bitpie苹果版下载官网|sklearn官方文档
bitpie苹果版下载官网|sklearn官方文档
scikit-learn: machine learning in Python — scikit-learn 1.4.1 documentation
scikit-learn: machine learning in Python — scikit-learn 1.4.1 documentation
Install
User Guide
API
Examples
Community
Getting Started
Tutorial
What's new
Glossary
Development
FAQ
Support
Related packages
Roadmap
Governance
About us
GitHub
Other Versions and Download
More
Getting Started
Tutorial
What's new
Glossary
Development
FAQ
Support
Related packages
Roadmap
Governance
About us
GitHub
Other Versions and Download
scikit-learn
Machine Learning in Python
Getting Started
Release Highlights for 1.4
GitHub
Simple and efficient tools for predictive data analysis
Accessible to everybody, and reusable in various contexts
Built on NumPy, SciPy, and matplotlib
Open source, commercially usable - BSD license
Classification
Identifying which category an object belongs to.
Applications: Spam detection, image recognition.
Algorithms:
Gradient boosting,
nearest neighbors,
random forest,
logistic regression,
and more...
Examples
Regression
Predicting a continuous-valued attribute associated with an object.
Applications: Drug response, Stock prices.
Algorithms:
Gradient boosting,
nearest neighbors,
random forest,
ridge,
and more...
Examples
Clustering
Automatic grouping of similar objects into sets.
Applications: Customer segmentation, Grouping experiment outcomes
Algorithms:
k-Means,
HDBSCAN,
hierarchical
clustering,
and more...
Examples
Dimensionality reduction
Reducing the number of random variables to consider.
Applications: Visualization, Increased efficiency
Algorithms:
PCA,
feature selection,
non-negative matrix factorization,
and more...
Examples
Model selection
Comparing, validating and choosing parameters and models.
Applications: Improved accuracy via parameter tuning
Algorithms:
grid search,
cross validation,
metrics,
and more...
Examples
Preprocessing
Feature extraction and normalization.
Applications: Transforming input data such as text for use with machine learning algorithms.
Algorithms:
preprocessing,
feature extraction,
and more...
Examples
News
On-going development:
scikit-learn 1.5 (Changelog)
February 2024. scikit-learn 1.4.1.post1 is available for download (Changelog).
January 2024. scikit-learn 1.4.0 is available for download (Changelog).
October 2023. scikit-learn 1.3.2 is available for download (Changelog).
September 2023. scikit-learn 1.3.1 is available for download (Changelog).
June 2023. scikit-learn 1.3.0 is available for download (Changelog).
All releases:
What's new (Changelog)
Community
About us: See authors and contributing
More Machine Learning: Find related projects
Questions? See FAQ and stackoverflow
Subscribe to the mailing list
Gitter: gitter.im/scikit-learn
Blog: blog.scikit-learn.org
Logos & Branding: logos and branding
Calendar: calendar
Twitter: @scikit_learn
LinkedIn: linkedin/scikit-learn
YouTube: youtube.com/scikit-learn
Facebook: @scikitlearnofficial
Instagram: @scikitlearnofficial
TikTok: @scikit.learn
Communication on all channels should respect PSF's code of conduct.
Help us, donate!
Cite us!
Who uses scikit-learn?
"We use scikit-learn to support leading-edge basic research [...]"
"I think it's the most well-designed ML package I've seen so far."
"scikit-learn's ease-of-use, performance and overall variety of algorithms implemented has proved invaluable [...]."
"The great benefit of scikit-learn is its fast learning curve [...]"
"It allows us to do AWesome stuff we would not otherwise accomplish"
"scikit-learn makes doing advanced analysis in Python accessible to anyone."
More testimonials
scikit-learn development and maintenance are financially supported by
User guide: contents — scikit-learn 1.4.1 documentation
User guide: contents — scikit-learn 1.4.1 documentation
Install
User Guide
API
Examples
Community
Getting Started
Tutorial
What's new
Glossary
Development
FAQ
Support
Related packages
Roadmap
Governance
About us
GitHub
Other Versions and Download
More
Getting Started
Tutorial
What's new
Glossary
Development
FAQ
Support
Related packages
Roadmap
Governance
About us
GitHub
Other Versions and Download
Toggle Menu
Prev
Up
Next
scikit-learn 1.4.1
Other versions
Please cite us if you use the software.
User Guide
1. Supervised learning
2. Unsupervised learning
3. Model selection and evaluation
4. Inspection
5. Visualizations
6. Dataset transformations
7. Dataset loading utilities
8. Computing with scikit-learn
9. Model persistence
10. Common pitfalls and recommended practices
11. Dispatching
User Guide¶
1. Supervised learning
1.1. Linear Models
1.1.1. Ordinary Least Squares
1.1.2. Ridge regression and classification
1.1.3. Lasso
1.1.4. Multi-task Lasso
1.1.5. Elastic-Net
1.1.6. Multi-task Elastic-Net
1.1.7. Least Angle Regression
1.1.8. LARS Lasso
1.1.9. Orthogonal Matching Pursuit (OMP)
1.1.10. Bayesian Regression
1.1.11. Logistic regression
1.1.12. Generalized Linear Models
1.1.13. Stochastic Gradient Descent - SGD
1.1.14. Perceptron
1.1.15. Passive Aggressive Algorithms
1.1.16. Robustness regression: outliers and modeling errors
1.1.17. Quantile Regression
1.1.18. Polynomial regression: extending linear models with basis functions
1.2. Linear and Quadratic Discriminant Analysis
1.2.1. Dimensionality reduction using Linear Discriminant Analysis
1.2.2. Mathematical formulation of the LDA and QDA classifiers
1.2.3. Mathematical formulation of LDA dimensionality reduction
1.2.4. Shrinkage and Covariance Estimator
1.2.5. Estimation algorithms
1.3. Kernel ridge regression
1.4. Support Vector Machines
1.4.1. Classification
1.4.2. Regression
1.4.3. Density estimation, novelty detection
1.4.4. Complexity
1.4.5. Tips on Practical Use
1.4.6. Kernel functions
1.4.7. Mathematical formulation
1.4.8. Implementation details
1.5. Stochastic Gradient Descent
1.5.1. Classification
1.5.2. Regression
1.5.3. Online One-Class SVM
1.5.4. Stochastic Gradient Descent for sparse data
1.5.5. Complexity
1.5.6. Stopping criterion
1.5.7. Tips on Practical Use
1.5.8. Mathematical formulation
1.5.9. Implementation details
1.6. Nearest Neighbors
1.6.1. Unsupervised Nearest Neighbors
1.6.2. Nearest Neighbors Classification
1.6.3. Nearest Neighbors Regression
1.6.4. Nearest Neighbor Algorithms
1.6.5. Nearest Centroid Classifier
1.6.6. Nearest Neighbors Transformer
1.6.7. Neighborhood Components Analysis
1.7. Gaussian Processes
1.7.1. Gaussian Process Regression (GPR)
1.7.2. Gaussian Process Classification (GPC)
1.7.3. GPC examples
1.7.4. Kernels for Gaussian Processes
1.8. Cross decomposition
1.8.1. PLSCanonical
1.8.2. PLSSVD
1.8.3. PLSRegression
1.8.4. Canonical Correlation Analysis
1.9. Naive Bayes
1.9.1. Gaussian Naive Bayes
1.9.2. Multinomial Naive Bayes
1.9.3. Complement Naive Bayes
1.9.4. Bernoulli Naive Bayes
1.9.5. Categorical Naive Bayes
1.9.6. Out-of-core naive Bayes model fitting
1.10. Decision Trees
1.10.1. Classification
1.10.2. Regression
1.10.3. Multi-output problems
1.10.4. Complexity
1.10.5. Tips on practical use
1.10.6. Tree algorithms: ID3, C4.5, C5.0 and CART
1.10.7. Mathematical formulation
1.10.8. Missing Values Support
1.10.9. Minimal Cost-Complexity Pruning
1.11. Ensembles: Gradient boosting, random forests, bagging, voting, stacking
1.11.1. Gradient-boosted trees
1.11.2. Random forests and other randomized tree ensembles
1.11.3. Bagging meta-estimator
1.11.4. Voting Classifier
1.11.5. Voting Regressor
1.11.6. Stacked generalization
1.11.7. AdaBoost
1.12. Multiclass and multioutput algorithms
1.12.1. Multiclass classification
1.12.2. Multilabel classification
1.12.3. Multiclass-multioutput classification
1.12.4. Multioutput regression
1.13. Feature selection
1.13.1. Removing features with low variance
1.13.2. Univariate feature selection
1.13.3. Recursive feature elimination
1.13.4. Feature selection using SelectFromModel
1.13.5. Sequential Feature Selection
1.13.6. Feature selection as part of a pipeline
1.14. Semi-supervised learning
1.14.1. Self Training
1.14.2. Label Propagation
1.15. Isotonic regression
1.16. Probability calibration
1.16.1. Calibration curves
1.16.2. Calibrating a classifier
1.16.3. Usage
1.17. Neural network models (supervised)
1.17.1. Multi-layer Perceptron
1.17.2. Classification
1.17.3. Regression
1.17.4. Regularization
1.17.5. Algorithms
1.17.6. Complexity
1.17.7. Mathematical formulation
1.17.8. Tips on Practical Use
1.17.9. More control with warm_start
2. Unsupervised learning
2.1. Gaussian mixture models
2.1.1. Gaussian Mixture
2.1.2. Variational Bayesian Gaussian Mixture
2.2. Manifold learning
2.2.1. Introduction
2.2.2. Isomap
2.2.3. Locally Linear Embedding
2.2.4. Modified Locally Linear Embedding
2.2.5. Hessian Eigenmapping
2.2.6. Spectral Embedding
2.2.7. Local Tangent Space Alignment
2.2.8. Multi-dimensional Scaling (MDS)
2.2.9. t-distributed Stochastic Neighbor Embedding (t-SNE)
2.2.10. Tips on practical use
2.3. Clustering
2.3.1. Overview of clustering methods
2.3.2. K-means
2.3.3. Affinity Propagation
2.3.4. Mean Shift
2.3.5. Spectral clustering
2.3.6. Hierarchical clustering
2.3.7. DBSCAN
2.3.8. HDBSCAN
2.3.9. OPTICS
2.3.10. BIRCH
2.3.11. Clustering performance evaluation
2.4. Biclustering
2.4.1. Spectral Co-Clustering
2.4.2. Spectral Biclustering
2.4.3. Biclustering evaluation
2.5. Decomposing signals in components (matrix factorization problems)
2.5.1. Principal component analysis (PCA)
2.5.2. Kernel Principal Component Analysis (kPCA)
2.5.3. Truncated singular value decomposition and latent semantic analysis
2.5.4. Dictionary Learning
2.5.5. Factor Analysis
2.5.6. Independent component analysis (ICA)
2.5.7. Non-negative matrix factorization (NMF or NNMF)
2.5.8. Latent Dirichlet Allocation (LDA)
2.6. Covariance estimation
2.6.1. Empirical covariance
2.6.2. Shrunk Covariance
2.6.3. Sparse inverse covariance
2.6.4. Robust Covariance Estimation
2.7. Novelty and Outlier Detection
2.7.1. Overview of outlier detection methods
2.7.2. Novelty Detection
2.7.3. Outlier Detection
2.7.4. Novelty detection with Local Outlier Factor
2.8. Density Estimation
2.8.1. Density Estimation: Histograms
2.8.2. Kernel Density Estimation
2.9. Neural network models (unsupervised)
2.9.1. Restricted Boltzmann machines
3. Model selection and evaluation
3.1. Cross-validation: evaluating estimator performance
3.1.1. Computing cross-validated metrics
3.1.2. Cross validation iterators
3.1.3. A note on shuffling
3.1.4. Cross validation and model selection
3.1.5. Permutation test score
3.2. Tuning the hyper-parameters of an estimator
3.2.1. Exhaustive Grid Search
3.2.2. Randomized Parameter Optimization
3.2.3. Searching for optimal parameters with successive halving
3.2.4. Tips for parameter search
3.2.5. Alternatives to brute force parameter search
3.3. Metrics and scoring: quantifying the quality of predictions
3.3.1. The scoring parameter: defining model evaluation rules
3.3.2. Classification metrics
3.3.3. Multilabel ranking metrics
3.3.4. Regression metrics
3.3.5. Clustering metrics
3.3.6. Dummy estimators
3.4. Validation curves: plotting scores to evaluate models
3.4.1. Validation curve
3.4.2. Learning curve
4. Inspection
4.1. Partial Dependence and Individual Conditional Expectation plots
4.1.1. Partial dependence plots
4.1.2. Individual conditional expectation (ICE) plot
4.1.3. Mathematical Definition
4.1.4. Computation methods
4.2. Permutation feature importance
4.2.1. Outline of the permutation importance algorithm
4.2.2. Relation to impurity-based importance in trees
4.2.3. Misleading values on strongly correlated features
5. Visualizations
5.1. Available Plotting Utilities
5.1.1. Display Objects
6. Dataset transformations
6.1. Pipelines and composite estimators
6.1.1. Pipeline: chaining estimators
6.1.2. Transforming target in regression
6.1.3. FeatureUnion: composite feature spaces
6.1.4. ColumnTransformer for heterogeneous data
6.1.5. Visualizing Composite Estimators
6.2. Feature extraction
6.2.1. Loading features from dicts
6.2.2. Feature hashing
6.2.3. Text feature extraction
6.2.4. Image feature extraction
6.3. Preprocessing data
6.3.1. Standardization, or mean removal and variance scaling
6.3.2. Non-linear transformation
6.3.3. Normalization
6.3.4. Encoding categorical features
6.3.5. Discretization
6.3.6. Imputation of missing values
6.3.7. Generating polynomial features
6.3.8. Custom transformers
6.4. Imputation of missing values
6.4.1. Univariate vs. Multivariate Imputation
6.4.2. Univariate feature imputation
6.4.3. Multivariate feature imputation
6.4.4. Nearest neighbors imputation
6.4.5. Keeping the number of features constant
6.4.6. Marking imputed values
6.4.7. Estimators that handle NaN values
6.5. Unsupervised dimensionality reduction
6.5.1. PCA: principal component analysis
6.5.2. Random projections
6.5.3. Feature agglomeration
6.6. Random Projection
6.6.1. The Johnson-Lindenstrauss lemma
6.6.2. Gaussian random projection
6.6.3. Sparse random projection
6.6.4. Inverse Transform
6.7. Kernel Approximation
6.7.1. Nystroem Method for Kernel Approximation
6.7.2. Radial Basis Function Kernel
6.7.3. Additive Chi Squared Kernel
6.7.4. Skewed Chi Squared Kernel
6.7.5. Polynomial Kernel Approximation via Tensor Sketch
6.7.6. Mathematical Details
6.8. Pairwise metrics, Affinities and Kernels
6.8.1. Cosine similarity
6.8.2. Linear kernel
6.8.3. Polynomial kernel
6.8.4. Sigmoid kernel
6.8.5. RBF kernel
6.8.6. Laplacian kernel
6.8.7. Chi-squared kernel
6.9. Transforming the prediction target (y)
6.9.1. Label binarization
6.9.2. Label encoding
7. Dataset loading utilities
7.1. Toy datasets
7.1.1. Iris plants dataset
7.1.2. Diabetes dataset
7.1.3. Optical recognition of handwritten digits dataset
7.1.4. Linnerrud dataset
7.1.5. Wine recognition dataset
7.1.6. Breast cancer wisconsin (diagnostic) dataset
7.2. Real world datasets
7.2.1. The Olivetti faces dataset
7.2.2. The 20 newsgroups text dataset
7.2.3. The Labeled Faces in the Wild face recognition dataset
7.2.4. Forest covertypes
7.2.5. RCV1 dataset
7.2.6. Kddcup 99 dataset
7.2.7. California Housing dataset
7.2.8. Species distribution dataset
7.3. Generated datasets
7.3.1. Generators for classification and clustering
7.3.2. Generators for regression
7.3.3. Generators for manifold learning
7.3.4. Generators for decomposition
7.4. Loading other datasets
7.4.1. Sample images
7.4.2. Datasets in svmlight / libsvm format
7.4.3. Downloading datasets from the openml.org repository
7.4.4. Loading from external datasets
8. Computing with scikit-learn
8.1. Strategies to scale computationally: bigger data
8.1.1. Scaling with instances using out-of-core learning
8.2. Computational Performance
8.2.1. Prediction Latency
8.2.2. Prediction Throughput
8.2.3. Tips and Tricks
8.3. Parallelism, resource management, and configuration
8.3.1. Parallelism
8.3.2. Configuration switches
9. Model persistence
9.1. Python specific serialization
9.1.1. Security & maintainability limitations
9.1.2. A more secure format: skops
9.2. Interoperable formats
10. Common pitfalls and recommended practices
10.1. Inconsistent preprocessing
10.2. Data leakage
10.2.1. How to avoid data leakage
10.2.2. Data leakage during pre-processing
10.3. Controlling randomness
10.3.1. Using None or RandomState instances, and repeated calls to fit and split
10.3.2. Common pitfalls and subtleties
10.3.3. General recommendations
11. Dispatching
11.1. Array API support (experimental)
11.1.1. Example usage
11.1.2. Support for Array API-compatible inputs
11.1.3. Common estimator checks
Under Development¶
1. Metadata Routing
© 2007 - 2024, scikit-learn developers (BSD License).
Show this page source
sklearn.svm.SVC — scikit-learn 1.4.1 documentation
sklearn.svm.SVC — scikit-learn 1.4.1 documentation
Install
User Guide
API
Examples
Community
Getting Started
Tutorial
What's new
Glossary
Development
FAQ
Support
Related packages
Roadmap
Governance
About us
GitHub
Other Versions and Download
More
Getting Started
Tutorial
What's new
Glossary
Development
FAQ
Support
Related packages
Roadmap
Governance
About us
GitHub
Other Versions and Download
Toggle Menu
PrevUp
Next
scikit-learn 1.4.1
Other versions
Please cite us if you use the software.
sklearn.svm.SVC
SVC
SVC.coef_
SVC.decision_function
SVC.fit
SVC.get_metadata_routing
SVC.get_params
SVC.n_support_
SVC.predict
SVC.predict_log_proba
SVC.predict_proba
SVC.probA_
SVC.probB_
SVC.score
SVC.set_fit_request
SVC.set_params
SVC.set_score_request
Examples using sklearn.svm.SVC
sklearn.svm.SVC¶
class sklearn.svm.SVC(*, C=1.0, kernel='rbf', degree=3, gamma='scale', coef0=0.0, shrinking=True, probability=False, tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=-1, decision_function_shape='ovr', break_ties=False, random_state=None)[source]¶
C-Support Vector Classification.
The implementation is based on libsvm. The fit time scales at least
quadratically with the number of samples and may be impractical
beyond tens of thousands of samples. For large datasets
consider using LinearSVC or
SGDClassifier instead, possibly after a
Nystroem transformer or
other Kernel Approximation.
The multiclass support is handled according to a one-vs-one scheme.
For details on the precise mathematical formulation of the provided
kernel functions and how gamma, coef0 and degree affect each
other, see the corresponding section in the narrative documentation:
Kernel functions.
To learn how to tune SVC’s hyperparameters, see the following example:
Nested versus non-nested cross-validation
Read more in the User Guide.
Parameters:
Cfloat, default=1.0Regularization parameter. The strength of the regularization is
inversely proportional to C. Must be strictly positive. The penalty
is a squared l2 penalty.
kernel{‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’} or callable, default=’rbf’Specifies the kernel type to be used in the algorithm. If
none is given, ‘rbf’ will be used. If a callable is given it is used to
pre-compute the kernel matrix from data matrices; that matrix should be
an array of shape (n_samples, n_samples). For an intuitive
visualization of different kernel types see
Plot classification boundaries with different SVM Kernels.
degreeint, default=3Degree of the polynomial kernel function (‘poly’).
Must be non-negative. Ignored by all other kernels.
gamma{‘scale’, ‘auto’} or float, default=’scale’Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’.
if gamma='scale' (default) is passed then it uses
1 / (n_features * X.var()) as value of gamma,
if ‘auto’, uses 1 / n_features
if float, must be non-negative.
Changed in version 0.22: The default value of gamma changed from ‘auto’ to ‘scale’.
coef0float, default=0.0Independent term in kernel function.
It is only significant in ‘poly’ and ‘sigmoid’.
shrinkingbool, default=TrueWhether to use the shrinking heuristic.
See the User Guide.
probabilitybool, default=FalseWhether to enable probability estimates. This must be enabled prior
to calling fit, will slow down that method as it internally uses
5-fold cross-validation, and predict_proba may be inconsistent with
predict. Read more in the User Guide.
tolfloat, default=1e-3Tolerance for stopping criterion.
cache_sizefloat, default=200Specify the size of the kernel cache (in MB).
class_weightdict or ‘balanced’, default=NoneSet the parameter C of class i to class_weight[i]*C for
SVC. If not given, all classes are supposed to have
weight one.
The “balanced” mode uses the values of y to automatically adjust
weights inversely proportional to class frequencies in the input data
as n_samples / (n_classes * np.bincount(y)).
verbosebool, default=FalseEnable verbose output. Note that this setting takes advantage of a
per-process runtime setting in libsvm that, if enabled, may not work
properly in a multithreaded context.
max_iterint, default=-1Hard limit on iterations within solver, or -1 for no limit.
decision_function_shape{‘ovo’, ‘ovr’}, default=’ovr’Whether to return a one-vs-rest (‘ovr’) decision function of shape
(n_samples, n_classes) as all other classifiers, or the original
one-vs-one (‘ovo’) decision function of libsvm which has shape
(n_samples, n_classes * (n_classes - 1) / 2). However, note that
internally, one-vs-one (‘ovo’) is always used as a multi-class strategy
to train models; an ovr matrix is only constructed from the ovo matrix.
The parameter is ignored for binary classification.
Changed in version 0.19: decision_function_shape is ‘ovr’ by default.
New in version 0.17: decision_function_shape=’ovr’ is recommended.
Changed in version 0.17: Deprecated decision_function_shape=’ovo’ and None.
break_tiesbool, default=FalseIf true, decision_function_shape='ovr', and number of classes > 2,
predict will break ties according to the confidence values of
decision_function; otherwise the first class among the tied
classes is returned. Please note that breaking ties comes at a
relatively high computational cost compared to a simple predict.
New in version 0.22.
random_stateint, RandomState instance or None, default=NoneControls the pseudo random number generation for shuffling the data for
probability estimates. Ignored when probability is False.
Pass an int for reproducible output across multiple function calls.
See Glossary.
Attributes:
class_weight_ndarray of shape (n_classes,)Multipliers of parameter C for each class.
Computed based on the class_weight parameter.
classes_ndarray of shape (n_classes,)The classes labels.
coef_ndarray of shape (n_classes * (n_classes - 1) / 2, n_features)Weights assigned to the features when kernel="linear".
dual_coef_ndarray of shape (n_classes -1, n_SV)Dual coefficients of the support vector in the decision
function (see Mathematical formulation), multiplied by
their targets.
For multiclass, coefficient for all 1-vs-1 classifiers.
The layout of the coefficients in the multiclass case is somewhat
non-trivial. See the multi-class section of the User Guide for details.
fit_status_int0 if correctly fitted, 1 otherwise (will raise warning)
intercept_ndarray of shape (n_classes * (n_classes - 1) / 2,)Constants in decision function.
n_features_in_intNumber of features seen during fit.
New in version 0.24.
feature_names_in_ndarray of shape (n_features_in_,)Names of features seen during fit. Defined only when X
has feature names that are all strings.
New in version 1.0.
n_iter_ndarray of shape (n_classes * (n_classes - 1) // 2,)Number of iterations run by the optimization routine to fit the model.
The shape of this attribute depends on the number of models optimized
which in turn depends on the number of classes.
New in version 1.1.
support_ndarray of shape (n_SV)Indices of support vectors.
support_vectors_ndarray of shape (n_SV, n_features)Support vectors. An empty array if kernel is precomputed.
n_support_ndarray of shape (n_classes,), dtype=int32Number of support vectors for each class.
probA_ndarray of shape (n_classes * (n_classes - 1) / 2)Parameter learned in Platt scaling when probability=True.
probB_ndarray of shape (n_classes * (n_classes - 1) / 2)Parameter learned in Platt scaling when probability=True.
shape_fit_tuple of int of shape (n_dimensions_of_X,)Array dimensions of training vector X.
See also
SVRSupport Vector Machine for Regression implemented using libsvm.
LinearSVCScalable Linear Support Vector Machine for classification implemented using liblinear. Check the See Also section of LinearSVC for more comparison element.
References
[1]
LIBSVM: A Library for Support Vector Machines
[2]
Platt, John (1999). “Probabilistic Outputs for Support Vector
Machines and Comparisons to Regularized Likelihood Methods”
Examples
>>> import numpy as np
>>> from sklearn.pipeline import make_pipeline
>>> from sklearn.preprocessing import StandardScaler
>>> X = np.array([[-1, -1], [-2, -1], [1, 1], [2, 1]])
>>> y = np.array([1, 1, 2, 2])
>>> from sklearn.svm import SVC
>>> clf = make_pipeline(StandardScaler(), SVC(gamma='auto'))
>>> clf.fit(X, y)
Pipeline(steps=[('standardscaler', StandardScaler()),
('svc', SVC(gamma='auto'))])
>>> print(clf.predict([[-0.8, -1]]))
[1]
Methods
decision_function(X)
Evaluate the decision function for the samples in X.
fit(X, y[, sample_weight])
Fit the SVM model according to the given training data.
get_metadata_routing()
Get metadata routing of this object.
get_params([deep])
Get parameters for this estimator.
predict(X)
Perform classification on samples in X.
predict_log_proba(X)
Compute log probabilities of possible outcomes for samples in X.
predict_proba(X)
Compute probabilities of possible outcomes for samples in X.
score(X, y[, sample_weight])
Return the mean accuracy on the given test data and labels.
set_fit_request(*[, sample_weight])
Request metadata passed to the fit method.
set_params(**params)
Set the parameters of this estimator.
set_score_request(*[, sample_weight])
Request metadata passed to the score method.
property coef_¶
Weights assigned to the features when kernel="linear".
Returns:
ndarray of shape (n_features, n_classes)
decision_function(X)[source]¶
Evaluate the decision function for the samples in X.
Parameters:
Xarray-like of shape (n_samples, n_features)The input samples.
Returns:
Xndarray of shape (n_samples, n_classes * (n_classes-1) / 2)Returns the decision function of the sample for each class
in the model.
If decision_function_shape=’ovr’, the shape is (n_samples,
n_classes).
Notes
If decision_function_shape=’ovo’, the function values are proportional
to the distance of the samples X to the separating hyperplane. If the
exact distances are required, divide the function values by the norm of
the weight vector (coef_). See also this question for further details.
If decision_function_shape=’ovr’, the decision function is a monotonic
transformation of ovo decision function.
fit(X, y, sample_weight=None)[source]¶
Fit the SVM model according to the given training data.
Parameters:
X{array-like, sparse matrix} of shape (n_samples, n_features) or (n_samples, n_samples)Training vectors, where n_samples is the number of samples
and n_features is the number of features.
For kernel=”precomputed”, the expected shape of X is
(n_samples, n_samples).
yarray-like of shape (n_samples,)Target values (class labels in classification, real numbers in
regression).
sample_weightarray-like of shape (n_samples,), default=NonePer-sample weights. Rescale C per sample. Higher weights
force the classifier to put more emphasis on these points.
Returns:
selfobjectFitted estimator.
Notes
If X and y are not C-ordered and contiguous arrays of np.float64 and
X is not a scipy.sparse.csr_matrix, X and/or y may be copied.
If X is a dense array, then the other methods will not support sparse
matrices as input.
get_metadata_routing()[source]¶
Get metadata routing of this object.
Please check User Guide on how the routing
mechanism works.
Returns:
routingMetadataRequestA MetadataRequest encapsulating
routing information.
get_params(deep=True)[source]¶
Get parameters for this estimator.
Parameters:
deepbool, default=TrueIf True, will return the parameters for this estimator and
contained subobjects that are estimators.
Returns:
paramsdictParameter names mapped to their values.
property n_support_¶
Number of support vectors for each class.
predict(X)[source]¶
Perform classification on samples in X.
For an one-class model, +1 or -1 is returned.
Parameters:
X{array-like, sparse matrix} of shape (n_samples, n_features) or (n_samples_test, n_samples_train)For kernel=”precomputed”, the expected shape of X is
(n_samples_test, n_samples_train).
Returns:
y_predndarray of shape (n_samples,)Class labels for samples in X.
predict_log_proba(X)[source]¶
Compute log probabilities of possible outcomes for samples in X.
The model need to have probability information computed at training
time: fit with attribute probability set to True.
Parameters:
Xarray-like of shape (n_samples, n_features) or (n_samples_test, n_samples_train)For kernel=”precomputed”, the expected shape of X is
(n_samples_test, n_samples_train).
Returns:
Tndarray of shape (n_samples, n_classes)Returns the log-probabilities of the sample for each class in
the model. The columns correspond to the classes in sorted
order, as they appear in the attribute classes_.
Notes
The probability model is created using cross validation, so
the results can be slightly different than those obtained by
predict. Also, it will produce meaningless results on very small
datasets.
predict_proba(X)[source]¶
Compute probabilities of possible outcomes for samples in X.
The model needs to have probability information computed at training
time: fit with attribute probability set to True.
Parameters:
Xarray-like of shape (n_samples, n_features)For kernel=”precomputed”, the expected shape of X is
(n_samples_test, n_samples_train).
Returns:
Tndarray of shape (n_samples, n_classes)Returns the probability of the sample for each class in
the model. The columns correspond to the classes in sorted
order, as they appear in the attribute classes_.
Notes
The probability model is created using cross validation, so
the results can be slightly different than those obtained by
predict. Also, it will produce meaningless results on very small
datasets.
property probA_¶
Parameter learned in Platt scaling when probability=True.
Returns:
ndarray of shape (n_classes * (n_classes - 1) / 2)
property probB_¶
Parameter learned in Platt scaling when probability=True.
Returns:
ndarray of shape (n_classes * (n_classes - 1) / 2)
score(X, y, sample_weight=None)[source]¶
Return the mean accuracy on the given test data and labels.
In multi-label classification, this is the subset accuracy
which is a harsh metric since you require for each sample that
each label set be correctly predicted.
Parameters:
Xarray-like of shape (n_samples, n_features)Test samples.
yarray-like of shape (n_samples,) or (n_samples, n_outputs)True labels for X.
sample_weightarray-like of shape (n_samples,), default=NoneSample weights.
Returns:
scorefloatMean accuracy of self.predict(X) w.r.t. y.
set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') → SVC[source]¶
Request metadata passed to the fit method.
Note that this method is only relevant if
enable_metadata_routing=True (see sklearn.set_config).
Please see User Guide on how the routing
mechanism works.
The options for each parameter are:
True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (sklearn.utils.metadata_routing.UNCHANGED) retains the
existing request. This allows you to change the request for some
parameters and not others.
New in version 1.3.
Note
This method is only relevant if this estimator is used as a
sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.
Parameters:
sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGEDMetadata routing for sample_weight parameter in fit.
Returns:
selfobjectThe updated object.
set_params(**params)[source]¶
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects
(such as Pipeline). The latter have
parameters of the form
possible to update each component of a nested object.
Parameters:
**paramsdictEstimator parameters.
Returns:
selfestimator instanceEstimator instance.
set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') → SVC[source]¶
Request metadata passed to the score method.
Note that this method is only relevant if
enable_metadata_routing=True (see sklearn.set_config).
Please see User Guide on how the routing
mechanism works.
The options for each parameter are:
True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to score.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (sklearn.utils.metadata_routing.UNCHANGED) retains the
existing request. This allows you to change the request for some
parameters and not others.
New in version 1.3.
Note
This method is only relevant if this estimator is used as a
sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.
Parameters:
sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGEDMetadata routing for sample_weight parameter in score.
Returns:
selfobjectThe updated object.
Examples using sklearn.svm.SVC¶
Release Highlights for scikit-learn 0.24
Release Highlights for scikit-learn 0.24
Release Highlights for scikit-learn 0.22
Release Highlights for scikit-learn 0.22
Classifier comparison
Classifier comparison
Plot classification probability
Plot classification probability
Recognizing hand-written digits
Recognizing hand-written digits
Plot the decision boundaries of a VotingClassifier
Plot the decision boundaries of a VotingClassifier
Faces recognition example using eigenfaces and SVMs
Faces recognition example using eigenfaces and SVMs
Libsvm GUI
Libsvm GUI
Recursive feature elimination
Recursive feature elimination
Scalable learning with polynomial kernel approximation
Scalable learning with polynomial kernel approximation
Displaying Pipelines
Displaying Pipelines
Explicit feature map approximation for RBF kernels
Explicit feature map approximation for RBF kernels
Multilabel classification
Multilabel classification
ROC Curve with Visualization API
ROC Curve with Visualization API
Comparison between grid search and successive halving
Comparison between grid search and successive halving
Confusion matrix
Confusion matrix
Custom refit strategy of a grid search with cross-validation
Custom refit strategy of a grid search with cross-validation
Nested versus non-nested cross-validation
Nested versus non-nested cross-validation
Plotting Learning Curves and Checking Models’ Scalability
Plotting Learning Curves and Checking Models' Scalability
Plotting Validation Curves
Plotting Validation Curves
Receiver Operating Characteristic (ROC) with cross validation
Receiver Operating Characteristic (ROC) with cross validation
Statistical comparison of models using grid search
Statistical comparison of models using grid search
Test with permutations the significance of a classification score
Test with permutations the significance of a classification score
Concatenating multiple feature extraction methods
Concatenating multiple feature extraction methods
Feature discretization
Feature discretization
Decision boundary of semi-supervised classifiers versus SVM on the Iris dataset
Decision boundary of semi-supervised classifiers versus SVM on the Iris dataset
Effect of varying threshold for self-training
Effect of varying threshold for self-training
Plot classification boundaries with different SVM Kernels
Plot classification boundaries with different SVM Kernels
Plot different SVM classifiers in the iris dataset
Plot different SVM classifiers in the iris dataset
RBF SVM parameters
RBF SVM parameters
SVM Margins Example
SVM Margins Example
SVM Tie Breaking Example
SVM Tie Breaking Example
SVM with custom kernel
SVM with custom kernel
SVM-Anova: SVM with univariate feature selection
SVM-Anova: SVM with univariate feature selection
SVM: Maximum margin separating hyperplane
SVM: Maximum margin separating hyperplane
SVM: Separating hyperplane for unbalanced classes
SVM: Separating hyperplane for unbalanced classes
SVM: Weighted samples
SVM: Weighted samples
SVM Exercise
SVM Exercise
© 2007 - 2024, scikit-learn developers (BSD License).
Show this page source
sklearn.decomposition.PCA — scikit-learn 1.4.1 documentation
sklearn.decomposition.PCA — scikit-learn 1.4.1 documentation
Install
User Guide
API
Examples
Community
Getting Started
Tutorial
What's new
Glossary
Development
FAQ
Support
Related packages
Roadmap
Governance
About us
GitHub
Other Versions and Download
More
Getting Started
Tutorial
What's new
Glossary
Development
FAQ
Support
Related packages
Roadmap
Governance
About us
GitHub
Other Versions and Download
Toggle Menu
PrevUp
Next
scikit-learn 1.4.1
Other versions
Please cite us if you use the software.
sklearn.decomposition.PCA
PCA
PCA.fit
PCA.fit_transform
PCA.get_covariance
PCA.get_feature_names_out
PCA.get_metadata_routing
PCA.get_params
PCA.get_precision
PCA.inverse_transform
PCA.score
PCA.score_samples
PCA.set_output
PCA.set_params
PCA.transform
Examples using sklearn.decomposition.PCA
sklearn.decomposition.PCA¶
class sklearn.decomposition.PCA(n_components=None, *, copy=True, whiten=False, svd_solver='auto', tol=0.0, iterated_power='auto', n_oversamples=10, power_iteration_normalizer='auto', random_state=None)[source]¶
Principal component analysis (PCA).
Linear dimensionality reduction using Singular Value Decomposition of the
data to project it to a lower dimensional space. The input data is centered
but not scaled for each feature before applying the SVD.
It uses the LAPACK implementation of the full SVD or a randomized truncated
SVD by the method of Halko et al. 2009, depending on the shape of the input
data and the number of components to extract.
It can also use the scipy.sparse.linalg ARPACK implementation of the
truncated SVD.
Notice that this class does not support sparse input. See
TruncatedSVD for an alternative with sparse data.
For a usage example, see
PCA example with Iris Data-set
Read more in the User Guide.
Parameters:
n_componentsint, float or ‘mle’, default=NoneNumber of components to keep.
if n_components is not set all components are kept:
n_components == min(n_samples, n_features)
If n_components == 'mle' and svd_solver == 'full', Minka’s
MLE is used to guess the dimension. Use of n_components == 'mle'
will interpret svd_solver == 'auto' as svd_solver == 'full'.
If 0 < n_components < 1 and svd_solver == 'full', select the
number of components such that the amount of variance that needs to be
explained is greater than the percentage specified by n_components.
If svd_solver == 'arpack', the number of components must be
strictly less than the minimum of n_features and n_samples.
Hence, the None case results in:
n_components == min(n_samples, n_features) - 1
copybool, default=TrueIf False, data passed to fit are overwritten and running
fit(X).transform(X) will not yield the expected results,
use fit_transform(X) instead.
whitenbool, default=FalseWhen True (False by default) the components_ vectors are multiplied
by the square root of n_samples and then divided by the singular values
to ensure uncorrelated outputs with unit component-wise variances.
Whitening will remove some information from the transformed signal
(the relative variance scales of the components) but can sometime
improve the predictive accuracy of the downstream estimators by
making their data respect some hard-wired assumptions.
svd_solver{‘auto’, ‘full’, ‘arpack’, ‘randomized’}, default=’auto’
If auto :The solver is selected by a default policy based on X.shape and
n_components: if the input data is larger than 500x500 and the
number of components to extract is lower than 80% of the smallest
dimension of the data, then the more efficient ‘randomized’
method is enabled. Otherwise the exact full SVD is computed and
optionally truncated afterwards.
If full :run exact full SVD calling the standard LAPACK solver via
scipy.linalg.svd and select the components by postprocessing
If arpack :run SVD truncated to n_components calling ARPACK solver via
scipy.sparse.linalg.svds. It requires strictly
0 < n_components < min(X.shape)
If randomized :run randomized SVD by the method of Halko et al.
New in version 0.18.0.
tolfloat, default=0.0Tolerance for singular values computed by svd_solver == ‘arpack’.
Must be of range [0.0, infinity).
New in version 0.18.0.
iterated_powerint or ‘auto’, default=’auto’Number of iterations for the power method computed by
svd_solver == ‘randomized’.
Must be of range [0, infinity).
New in version 0.18.0.
n_oversamplesint, default=10This parameter is only relevant when svd_solver="randomized".
It corresponds to the additional number of random vectors to sample the
range of X so as to ensure proper conditioning. See
randomized_svd for more details.
New in version 1.1.
power_iteration_normalizer{‘auto’, ‘QR’, ‘LU’, ‘none’}, default=’auto’Power iteration normalizer for randomized SVD solver.
Not used by ARPACK. See randomized_svd
for more details.
New in version 1.1.
random_stateint, RandomState instance or None, default=NoneUsed when the ‘arpack’ or ‘randomized’ solvers are used. Pass an int
for reproducible results across multiple function calls.
See Glossary.
New in version 0.18.0.
Attributes:
components_ndarray of shape (n_components, n_features)Principal axes in feature space, representing the directions of
maximum variance in the data. Equivalently, the right singular
vectors of the centered input data, parallel to its eigenvectors.
The components are sorted by decreasing explained_variance_.
explained_variance_ndarray of shape (n_components,)The amount of variance explained by each of the selected components.
The variance estimation uses n_samples - 1 degrees of freedom.
Equal to n_components largest eigenvalues
of the covariance matrix of X.
New in version 0.18.
explained_variance_ratio_ndarray of shape (n_components,)Percentage of variance explained by each of the selected components.
If n_components is not set then all components are stored and the
sum of the ratios is equal to 1.0.
singular_values_ndarray of shape (n_components,)The singular values corresponding to each of the selected components.
The singular values are equal to the 2-norms of the n_components
variables in the lower-dimensional space.
New in version 0.19.
mean_ndarray of shape (n_features,)Per-feature empirical mean, estimated from the training set.
Equal to X.mean(axis=0).
n_components_intThe estimated number of components. When n_components is set
to ‘mle’ or a number between 0 and 1 (with svd_solver == ‘full’) this
number is estimated from input data. Otherwise it equals the parameter
n_components, or the lesser value of n_features and n_samples
if n_components is None.
n_samples_intNumber of samples in the training data.
noise_variance_floatThe estimated noise covariance following the Probabilistic PCA model
from Tipping and Bishop 1999. See “Pattern Recognition and
Machine Learning” by C. Bishop, 12.2.1 p. 574 or
http://www.miketipping.com/papers/met-mppca.pdf. It is required to
compute the estimated data covariance and score samples.
Equal to the average of (min(n_features, n_samples) - n_components)
smallest eigenvalues of the covariance matrix of X.
n_features_in_intNumber of features seen during fit.
New in version 0.24.
feature_names_in_ndarray of shape (n_features_in_,)Names of features seen during fit. Defined only when X
has feature names that are all strings.
New in version 1.0.
See also
KernelPCAKernel Principal Component Analysis.
SparsePCASparse Principal Component Analysis.
TruncatedSVDDimensionality reduction using truncated SVD.
IncrementalPCAIncremental Principal Component Analysis.
References
For n_components == ‘mle’, this class uses the method from:
Minka, T. P.. “Automatic choice of dimensionality for PCA”.
In NIPS, pp. 598-604
Implements the probabilistic PCA model from:
Tipping, M. E., and Bishop, C. M. (1999). “Probabilistic principal
component analysis”. Journal of the Royal Statistical Society:
Series B (Statistical Methodology), 61(3), 611-622.
via the score and score_samples methods.
For svd_solver == ‘arpack’, refer to scipy.sparse.linalg.svds.
For svd_solver == ‘randomized’, see:
Halko, N., Martinsson, P. G., and Tropp, J. A. (2011).
“Finding structure with randomness: Probabilistic algorithms for
constructing approximate matrix decompositions”.
SIAM review, 53(2), 217-288.
and also
Martinsson, P. G., Rokhlin, V., and Tygert, M. (2011).
“A randomized algorithm for the decomposition of matrices”.
Applied and Computational Harmonic Analysis, 30(1), 47-68.
Examples
>>> import numpy as np
>>> from sklearn.decomposition import PCA
>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
>>> pca = PCA(n_components=2)
>>> pca.fit(X)
PCA(n_components=2)
>>> print(pca.explained_variance_ratio_)
[0.9924... 0.0075...]
>>> print(pca.singular_values_)
[6.30061... 0.54980...]
>>> pca = PCA(n_components=2, svd_solver='full')
>>> pca.fit(X)
PCA(n_components=2, svd_solver='full')
>>> print(pca.explained_variance_ratio_)
[0.9924... 0.00755...]
>>> print(pca.singular_values_)
[6.30061... 0.54980...]
>>> pca = PCA(n_components=1, svd_solver='arpack')
>>> pca.fit(X)
PCA(n_components=1, svd_solver='arpack')
>>> print(pca.explained_variance_ratio_)
[0.99244...]
>>> print(pca.singular_values_)
[6.30061...]
Methods
fit(X[, y])
Fit the model with X.
fit_transform(X[, y])
Fit the model with X and apply the dimensionality reduction on X.
get_covariance()
Compute data covariance with the generative model.
get_feature_names_out([input_features])
Get output feature names for transformation.
get_metadata_routing()
Get metadata routing of this object.
get_params([deep])
Get parameters for this estimator.
get_precision()
Compute data precision matrix with the generative model.
inverse_transform(X)
Transform data back to its original space.
score(X[, y])
Return the average log-likelihood of all samples.
score_samples(X)
Return the log-likelihood of each sample.
set_output(*[, transform])
Set output container.
set_params(**params)
Set the parameters of this estimator.
transform(X)
Apply dimensionality reduction to X.
fit(X, y=None)[source]¶
Fit the model with X.
Parameters:
X{array-like, sparse matrix} of shape (n_samples, n_features)Training data, where n_samples is the number of samples
and n_features is the number of features.
yIgnoredIgnored.
Returns:
selfobjectReturns the instance itself.
fit_transform(X, y=None)[source]¶
Fit the model with X and apply the dimensionality reduction on X.
Parameters:
X{array-like, sparse matrix} of shape (n_samples, n_features)Training data, where n_samples is the number of samples
and n_features is the number of features.
yIgnoredIgnored.
Returns:
X_newndarray of shape (n_samples, n_components)Transformed values.
Notes
This method returns a Fortran-ordered array. To convert it to a
C-ordered array, use ‘np.ascontiguousarray’.
get_covariance()[source]¶
Compute data covariance with the generative model.
cov = components_.T * S**2 * components_ + sigma2 * eye(n_features)
where S**2 contains the explained variances, and sigma2 contains the
noise variances.
Returns:
covarray of shape=(n_features, n_features)Estimated covariance of data.
get_feature_names_out(input_features=None)[source]¶
Get output feature names for transformation.
The feature names out will prefixed by the lowercased class name. For
example, if the transformer outputs 3 features, then the feature names
out are: ["class_name0", "class_name1", "class_name2"].
Parameters:
input_featuresarray-like of str or None, default=NoneOnly used to validate feature names with the names seen in fit.
Returns:
feature_names_outndarray of str objectsTransformed feature names.
get_metadata_routing()[source]¶
Get metadata routing of this object.
Please check User Guide on how the routing
mechanism works.
Returns:
routingMetadataRequestA MetadataRequest encapsulating
routing information.
get_params(deep=True)[source]¶
Get parameters for this estimator.
Parameters:
deepbool, default=TrueIf True, will return the parameters for this estimator and
contained subobjects that are estimators.
Returns:
paramsdictParameter names mapped to their values.
get_precision()[source]¶
Compute data precision matrix with the generative model.
Equals the inverse of the covariance but computed with
the matrix inversion lemma for efficiency.
Returns:
precisionarray, shape=(n_features, n_features)Estimated precision of data.
inverse_transform(X)[source]¶
Transform data back to its original space.
In other words, return an input X_original whose transform would be X.
Parameters:
Xarray-like of shape (n_samples, n_components)New data, where n_samples is the number of samples
and n_components is the number of components.
Returns:
X_original array-like of shape (n_samples, n_features)Original data, where n_samples is the number of samples
and n_features is the number of features.
Notes
If whitening is enabled, inverse_transform will compute the
exact inverse operation, which includes reversing whitening.
score(X, y=None)[source]¶
Return the average log-likelihood of all samples.
See. “Pattern Recognition and Machine Learning”
by C. Bishop, 12.2.1 p. 574
or http://www.miketipping.com/papers/met-mppca.pdf
Parameters:
Xarray-like of shape (n_samples, n_features)The data.
yIgnoredIgnored.
Returns:
llfloatAverage log-likelihood of the samples under the current model.
score_samples(X)[source]¶
Return the log-likelihood of each sample.
See. “Pattern Recognition and Machine Learning”
by C. Bishop, 12.2.1 p. 574
or http://www.miketipping.com/papers/met-mppca.pdf
Parameters:
Xarray-like of shape (n_samples, n_features)The data.
Returns:
llndarray of shape (n_samples,)Log-likelihood of each sample under the current model.
set_output(*, transform=None)[source]¶
Set output container.
See Introducing the set_output API
for an example on how to use the API.
Parameters:
transform{“default”, “pandas”}, default=NoneConfigure output of transform and fit_transform.
"default": Default output format of a transformer
"pandas": DataFrame output
"polars": Polars output
None: Transform configuration is unchanged
New in version 1.4: "polars" option was added.
Returns:
selfestimator instanceEstimator instance.
set_params(**params)[source]¶
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects
(such as Pipeline). The latter have
parameters of the form
possible to update each component of a nested object.
Parameters:
**paramsdictEstimator parameters.
Returns:
selfestimator instanceEstimator instance.
transform(X)[source]¶
Apply dimensionality reduction to X.
X is projected on the first principal components previously extracted
from a training set.
Parameters:
X{array-like, sparse matrix} of shape (n_samples, n_features)New data, where n_samples is the number of samples
and n_features is the number of features.
Returns:
X_newarray-like of shape (n_samples, n_components)Projection of X in the first principal components, where n_samples
is the number of samples and n_components is the number of the components.
Examples using sklearn.decomposition.PCA¶
Release Highlights for scikit-learn 1.4
Release Highlights for scikit-learn 1.4
A demo of K-Means clustering on the handwritten digits data
A demo of K-Means clustering on the handwritten digits data
Principal Component Regression vs Partial Least Squares Regression
Principal Component Regression vs Partial Least Squares Regression
The Iris Dataset
The Iris Dataset
Blind source separation using FastICA
Blind source separation using FastICA
Comparison of LDA and PCA 2D projection of Iris dataset
Comparison of LDA and PCA 2D projection of Iris dataset
Faces dataset decompositions
Faces dataset decompositions
Factor Analysis (with rotation) to visualize patterns
Factor Analysis (with rotation) to visualize patterns
FastICA on 2D point clouds
FastICA on 2D point clouds
Incremental PCA
Incremental PCA
Kernel PCA
Kernel PCA
Model selection with Probabilistic PCA and Factor Analysis (FA)
Model selection with Probabilistic PCA and Factor Analysis (FA)
PCA example with Iris Data-set
PCA example with Iris Data-set
Faces recognition example using eigenfaces and SVMs
Faces recognition example using eigenfaces and SVMs
Image denoising using kernel PCA
Image denoising using kernel PCA
Multi-dimensional scaling
Multi-dimensional scaling
Displaying Pipelines
Displaying Pipelines
Explicit feature map approximation for RBF kernels
Explicit feature map approximation for RBF kernels
Multilabel classification
Multilabel classification
Balance model complexity and cross-validated score
Balance model complexity and cross-validated score
Dimensionality Reduction with Neighborhood Components Analysis
Dimensionality Reduction with Neighborhood Components Analysis
Kernel Density Estimation
Kernel Density Estimation
Column Transformer with Heterogeneous Data Sources
Column Transformer with Heterogeneous Data Sources
Concatenating multiple feature extraction methods
Concatenating multiple feature extraction methods
Pipelining: chaining a PCA and a logistic regression
Pipelining: chaining a PCA and a logistic regression
Selecting dimensionality reduction with Pipeline and GridSearchCV
Selecting dimensionality reduction with Pipeline and GridSearchCV
Importance of Feature Scaling
Importance of Feature Scaling
© 2007 - 2024, scikit-learn developers (BSD License).
Show this page source
scikit-learn中文社区
scikit-learn中文社区
安装
用户指南
API
案例
更多
入门
教程
更新日志
词汇表
常见问题
交流群
scikit-learn
Machine Learning in Python
入门
0.23版本的发布要点
GitHub
交流微信群二维码
简单有效的工具进行预测数据分析
每个人都可以访问,并且可以在各种情况下重用
基于NumPy,SciPy和matplotlib构建
开源,可商业使用-BSD许可证
分类
标识对象所属的类别。
应用范围: 垃圾邮件检测,图像识别。
算法:
SVM
最近邻
随机森林
更多...
Examples
回归
预测与对象关联的连续值属性。
应用范围: 药物反应,股票价格。
算法:
SVR
最近邻
随机森林
更多...
Examples
聚类
自动将相似对象归为一组。
应用: 客户细分,分组实验成果。
算法:
K-均值
谱聚类
MeanShift
更多...
Examples
降维
减少要考虑的随机变量的数量。
应用: 可视化,提高效率。
算法:
K-均值
特征选择
非负矩阵分解
更多...
Examples
模型选择
比较,验证和选择参数和模型。
应用: 通过参数调整改进精度。
算法:
网格搜索
交叉验证
指标
更多...
Examples
预处理
特征提取和归一化。
应用程序: 转换输入数据,例如文本,以供机器学习算法使用。
算法:
预处理
特征提取
更多...
Examples
新闻
正在开发中的版本:
更新日志 (Changelog)
2020年8月. scikit-learn 0.23.2 可供下载 (更新日志).
2020年5月. scikit-learn 0.23.1 可供下载 (更新日志).
2020年5月. scikit-learn 0.23.0 可供下载 (更新日志).
Scikit-learn from 0.23 要求 Python 3.6 或更高版本.
2020年3月. scikit-learn 0.22.2 可供下载 (更新日志).
2020年1月. scikit-learn 0.22.1 可供下载 (更新日志).
2019年12月. scikit-learn 0.22 可供下载 (更新日志 and 发布亮点).
Scikit-learn from 0.21 要求 Python 3.5 或更高版本.
2019年7月. scikit-learn 0.21.3 (Changelog) and 0.20.4 (更新日志) 可供下载.
2019年5月. scikit-learn 0.21.0 to 0.21.2 可供下载 (更新日志).
关于
关于我们: CDA数据科学研究院
赞助支持: CDA考试认证中心
客服电话: +86 4000-51-9191
邮箱: service@cda.cn
关注我们
"Scikit-learn 中文文档由CDA数据科学研究院翻译,扫码关注获取更多信息。"
Copyright © 2015-2020, CDA数据科学研究院 版权所有 京ICP备11001960号-13
sklearn.cluster.KMeans — scikit-learn 1.4.1 documentation
sklearn.cluster.KMeans — scikit-learn 1.4.1 documentation
Install
User Guide
API
Examples
Community
Getting Started
Tutorial
What's new
Glossary
Development
FAQ
Support
Related packages
Roadmap
Governance
About us
GitHub
Other Versions and Download
More
Getting Started
Tutorial
What's new
Glossary
Development
FAQ
Support
Related packages
Roadmap
Governance
About us
GitHub
Other Versions and Download
Toggle Menu
PrevUp
Next
scikit-learn 1.4.1
Other versions
Please cite us if you use the software.
sklearn.cluster.KMeans
KMeans
KMeans.fit
KMeans.fit_predict
KMeans.fit_transform
KMeans.get_feature_names_out
KMeans.get_metadata_routing
KMeans.get_params
KMeans.predict
KMeans.score
KMeans.set_fit_request
KMeans.set_output
KMeans.set_params
KMeans.set_predict_request
KMeans.set_score_request
KMeans.transform
Examples using sklearn.cluster.KMeans
sklearn.cluster.KMeans¶
class sklearn.cluster.KMeans(n_clusters=8, *, init='k-means++', n_init='auto', max_iter=300, tol=0.0001, verbose=0, random_state=None, copy_x=True, algorithm='lloyd')[source]¶
K-Means clustering.
Read more in the User Guide.
Parameters:
n_clustersint, default=8The number of clusters to form as well as the number of
centroids to generate.
For an example of how to choose an optimal value for n_clusters refer to
Selecting the number of clusters with silhouette analysis on KMeans clustering.
init{‘k-means++’, ‘random’}, callable or array-like of shape (n_clusters, n_features), default=’k-means++’Method for initialization:
‘k-means++’ : selects initial cluster centroids using sampling based on an empirical probability distribution of the points’ contribution to the overall inertia. This technique speeds up convergence. The algorithm implemented is “greedy k-means++”. It differs from the vanilla k-means++ by making several trials at each sampling step and choosing the best centroid among them.
‘random’: choose n_clusters observations (rows) at random from data for the initial centroids.
If an array is passed, it should be of shape (n_clusters, n_features) and gives the initial centers.
If a callable is passed, it should take arguments X, n_clusters and a random state and return an initialization.
For an example of how to use the different init strategy, see the example
entitled A demo of K-Means clustering on the handwritten digits data.
n_init‘auto’ or int, default=’auto’Number of times the k-means algorithm is run with different centroid
seeds. The final results is the best output of n_init consecutive runs
in terms of inertia. Several runs are recommended for sparse
high-dimensional problems (see Clustering sparse data with k-means).
When n_init='auto', the number of runs depends on the value of init:
10 if using init='random' or init is a callable;
1 if using init='k-means++' or init is an array-like.
New in version 1.2: Added ‘auto’ option for n_init.
Changed in version 1.4: Default value for n_init changed to 'auto'.
max_iterint, default=300Maximum number of iterations of the k-means algorithm for a
single run.
tolfloat, default=1e-4Relative tolerance with regards to Frobenius norm of the difference
in the cluster centers of two consecutive iterations to declare
convergence.
verboseint, default=0Verbosity mode.
random_stateint, RandomState instance or None, default=NoneDetermines random number generation for centroid initialization. Use
an int to make the randomness deterministic.
See Glossary.
copy_xbool, default=TrueWhen pre-computing distances it is more numerically accurate to center
the data first. If copy_x is True (default), then the original data is
not modified. If False, the original data is modified, and put back
before the function returns, but small numerical differences may be
introduced by subtracting and then adding the data mean. Note that if
the original data is not C-contiguous, a copy will be made even if
copy_x is False. If the original data is sparse, but not in CSR format,
a copy will be made even if copy_x is False.
algorithm{“lloyd”, “elkan”}, default=”lloyd”K-means algorithm to use. The classical EM-style algorithm is "lloyd".
The "elkan" variation can be more efficient on some datasets with
well-defined clusters, by using the triangle inequality. However it’s
more memory intensive due to the allocation of an extra array of shape
(n_samples, n_clusters).
Changed in version 0.18: Added Elkan algorithm
Changed in version 1.1: Renamed “full” to “lloyd”, and deprecated “auto” and “full”.
Changed “auto” to use “lloyd” instead of “elkan”.
Attributes:
cluster_centers_ndarray of shape (n_clusters, n_features)Coordinates of cluster centers. If the algorithm stops before fully
converging (see tol and max_iter), these will not be
consistent with labels_.
labels_ndarray of shape (n_samples,)Labels of each point
inertia_floatSum of squared distances of samples to their closest cluster center,
weighted by the sample weights if provided.
n_iter_intNumber of iterations run.
n_features_in_intNumber of features seen during fit.
New in version 0.24.
feature_names_in_ndarray of shape (n_features_in_,)Names of features seen during fit. Defined only when X
has feature names that are all strings.
New in version 1.0.
See also
MiniBatchKMeansAlternative online implementation that does incremental updates of the centers positions using mini-batches. For large scale learning (say n_samples > 10k) MiniBatchKMeans is probably much faster than the default batch implementation.
Notes
The k-means problem is solved using either Lloyd’s or Elkan’s algorithm.
The average complexity is given by O(k n T), where n is the number of
samples and T is the number of iteration.
The worst case complexity is given by O(n^(k+2/p)) with
n = n_samples, p = n_features.
Refer to “How slow is the k-means method?” D. Arthur and S. Vassilvitskii -
SoCG2006. for more details.
In practice, the k-means algorithm is very fast (one of the fastest
clustering algorithms available), but it falls in local minima. That’s why
it can be useful to restart it several times.
If the algorithm stops before fully converging (because of tol or
max_iter), labels_ and cluster_centers_ will not be consistent,
i.e. the cluster_centers_ will not be the means of the points in each
cluster. Also, the estimator will reassign labels_ after the last
iteration to make labels_ consistent with predict on the training
set.
Examples
>>> from sklearn.cluster import KMeans
>>> import numpy as np
>>> X = np.array([[1, 2], [1, 4], [1, 0],
... [10, 2], [10, 4], [10, 0]])
>>> kmeans = KMeans(n_clusters=2, random_state=0, n_init="auto").fit(X)
>>> kmeans.labels_
array([1, 1, 1, 0, 0, 0], dtype=int32)
>>> kmeans.predict([[0, 0], [12, 3]])
array([1, 0], dtype=int32)
>>> kmeans.cluster_centers_
array([[10., 2.],
[ 1., 2.]])
For a more detailed example of K-Means using the iris dataset see
K-means Clustering.
For examples of common problems with K-Means and how to address them see
Demonstration of k-means assumptions.
For an example of how to use K-Means to perform color quantization see
Color Quantization using K-Means.
For a demonstration of how K-Means can be used to cluster text documents see
Clustering text documents using k-means.
For a comparison between K-Means and MiniBatchKMeans refer to example
Comparison of the K-Means and MiniBatchKMeans clustering algorithms.
Methods
fit(X[, y, sample_weight])
Compute k-means clustering.
fit_predict(X[, y, sample_weight])
Compute cluster centers and predict cluster index for each sample.
fit_transform(X[, y, sample_weight])
Compute clustering and transform X to cluster-distance space.
get_feature_names_out([input_features])
Get output feature names for transformation.
get_metadata_routing()
Get metadata routing of this object.
get_params([deep])
Get parameters for this estimator.
predict(X[, sample_weight])
Predict the closest cluster each sample in X belongs to.
score(X[, y, sample_weight])
Opposite of the value of X on the K-means objective.
set_fit_request(*[, sample_weight])
Request metadata passed to the fit method.
set_output(*[, transform])
Set output container.
set_params(**params)
Set the parameters of this estimator.
set_predict_request(*[, sample_weight])
Request metadata passed to the predict method.
set_score_request(*[, sample_weight])
Request metadata passed to the score method.
transform(X)
Transform X to a cluster-distance space.
fit(X, y=None, sample_weight=None)[source]¶
Compute k-means clustering.
Parameters:
X{array-like, sparse matrix} of shape (n_samples, n_features)Training instances to cluster. It must be noted that the data
will be converted to C ordering, which will cause a memory
copy if the given data is not C-contiguous.
If a sparse matrix is passed, a copy will be made if it’s not in
CSR format.
yIgnoredNot used, present here for API consistency by convention.
sample_weightarray-like of shape (n_samples,), default=NoneThe weights for each observation in X. If None, all observations
are assigned equal weight. sample_weight is not used during
initialization if init is a callable or a user provided array.
New in version 0.20.
Returns:
selfobjectFitted estimator.
fit_predict(X, y=None, sample_weight=None)[source]¶
Compute cluster centers and predict cluster index for each sample.
Convenience method; equivalent to calling fit(X) followed by
predict(X).
Parameters:
X{array-like, sparse matrix} of shape (n_samples, n_features)New data to transform.
yIgnoredNot used, present here for API consistency by convention.
sample_weightarray-like of shape (n_samples,), default=NoneThe weights for each observation in X. If None, all observations
are assigned equal weight.
Returns:
labelsndarray of shape (n_samples,)Index of the cluster each sample belongs to.
fit_transform(X, y=None, sample_weight=None)[source]¶
Compute clustering and transform X to cluster-distance space.
Equivalent to fit(X).transform(X), but more efficiently implemented.
Parameters:
X{array-like, sparse matrix} of shape (n_samples, n_features)New data to transform.
yIgnoredNot used, present here for API consistency by convention.
sample_weightarray-like of shape (n_samples,), default=NoneThe weights for each observation in X. If None, all observations
are assigned equal weight.
Returns:
X_newndarray of shape (n_samples, n_clusters)X transformed in the new space.
get_feature_names_out(input_features=None)[source]¶
Get output feature names for transformation.
The feature names out will prefixed by the lowercased class name. For
example, if the transformer outputs 3 features, then the feature names
out are: ["class_name0", "class_name1", "class_name2"].
Parameters:
input_featuresarray-like of str or None, default=NoneOnly used to validate feature names with the names seen in fit.
Returns:
feature_names_outndarray of str objectsTransformed feature names.
get_metadata_routing()[source]¶
Get metadata routing of this object.
Please check User Guide on how the routing
mechanism works.
Returns:
routingMetadataRequestA MetadataRequest encapsulating
routing information.
get_params(deep=True)[source]¶
Get parameters for this estimator.
Parameters:
deepbool, default=TrueIf True, will return the parameters for this estimator and
contained subobjects that are estimators.
Returns:
paramsdictParameter names mapped to their values.
predict(X, sample_weight='deprecated')[source]¶
Predict the closest cluster each sample in X belongs to.
In the vector quantization literature, cluster_centers_ is called
the code book and each value returned by predict is the index of
the closest code in the code book.
Parameters:
X{array-like, sparse matrix} of shape (n_samples, n_features)New data to predict.
sample_weightarray-like of shape (n_samples,), default=NoneThe weights for each observation in X. If None, all observations
are assigned equal weight.
Deprecated since version 1.3: The parameter sample_weight is deprecated in version 1.3
and will be removed in 1.5.
Returns:
labelsndarray of shape (n_samples,)Index of the cluster each sample belongs to.
score(X, y=None, sample_weight=None)[source]¶
Opposite of the value of X on the K-means objective.
Parameters:
X{array-like, sparse matrix} of shape (n_samples, n_features)New data.
yIgnoredNot used, present here for API consistency by convention.
sample_weightarray-like of shape (n_samples,), default=NoneThe weights for each observation in X. If None, all observations
are assigned equal weight.
Returns:
scorefloatOpposite of the value of X on the K-means objective.
set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') → KMeans[source]¶
Request metadata passed to the fit method.
Note that this method is only relevant if
enable_metadata_routing=True (see sklearn.set_config).
Please see User Guide on how the routing
mechanism works.
The options for each parameter are:
True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (sklearn.utils.metadata_routing.UNCHANGED) retains the
existing request. This allows you to change the request for some
parameters and not others.
New in version 1.3.
Note
This method is only relevant if this estimator is used as a
sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.
Parameters:
sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGEDMetadata routing for sample_weight parameter in fit.
Returns:
selfobjectThe updated object.
set_output(*, transform=None)[source]¶
Set output container.
See Introducing the set_output API
for an example on how to use the API.
Parameters:
transform{“default”, “pandas”}, default=NoneConfigure output of transform and fit_transform.
"default": Default output format of a transformer
"pandas": DataFrame output
"polars": Polars output
None: Transform configuration is unchanged
New in version 1.4: "polars" option was added.
Returns:
selfestimator instanceEstimator instance.
set_params(**params)[source]¶
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects
(such as Pipeline). The latter have
parameters of the form
possible to update each component of a nested object.
Parameters:
**paramsdictEstimator parameters.
Returns:
selfestimator instanceEstimator instance.
set_predict_request(*, sample_weight: bool | None | str = '$UNCHANGED$') → KMeans[source]¶
Request metadata passed to the predict method.
Note that this method is only relevant if
enable_metadata_routing=True (see sklearn.set_config).
Please see User Guide on how the routing
mechanism works.
The options for each parameter are:
True: metadata is requested, and passed to predict if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to predict.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (sklearn.utils.metadata_routing.UNCHANGED) retains the
existing request. This allows you to change the request for some
parameters and not others.
New in version 1.3.
Note
This method is only relevant if this estimator is used as a
sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.
Parameters:
sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGEDMetadata routing for sample_weight parameter in predict.
Returns:
selfobjectThe updated object.
set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') → KMeans[source]¶
Request metadata passed to the score method.
Note that this method is only relevant if
enable_metadata_routing=True (see sklearn.set_config).
Please see User Guide on how the routing
mechanism works.
The options for each parameter are:
True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to score.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (sklearn.utils.metadata_routing.UNCHANGED) retains the
existing request. This allows you to change the request for some
parameters and not others.
New in version 1.3.
Note
This method is only relevant if this estimator is used as a
sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.
Parameters:
sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGEDMetadata routing for sample_weight parameter in score.
Returns:
selfobjectThe updated object.
transform(X)[source]¶
Transform X to a cluster-distance space.
In the new space, each dimension is the distance to the cluster
centers. Note that even if X is sparse, the array returned by
transform will typically be dense.
Parameters:
X{array-like, sparse matrix} of shape (n_samples, n_features)New data to transform.
Returns:
X_newndarray of shape (n_samples, n_clusters)X transformed in the new space.
Examples using sklearn.cluster.KMeans¶
Release Highlights for scikit-learn 1.1
Release Highlights for scikit-learn 1.1
Release Highlights for scikit-learn 0.23
Release Highlights for scikit-learn 0.23
A demo of K-Means clustering on the handwritten digits data
A demo of K-Means clustering on the handwritten digits data
Bisecting K-Means and Regular K-Means Performance Comparison
Bisecting K-Means and Regular K-Means Performance Comparison
Color Quantization using K-Means
Color Quantization using K-Means
Comparison of the K-Means and MiniBatchKMeans clustering algorithms
Comparison of the K-Means and MiniBatchKMeans clustering algorithms
Demonstration of k-means assumptions
Demonstration of k-means assumptions
Empirical evaluation of the impact of k-means initialization
Empirical evaluation of the impact of k-means initialization
K-means Clustering
K-means Clustering
Selecting the number of clusters with silhouette analysis on KMeans clustering
Selecting the number of clusters with silhouette analysis on KMeans clustering
Clustering text documents using k-means
Clustering text documents using k-means
© 2007 - 2024, scikit-learn developers (BSD License).
Show this page source
sklearn.model_selection.GridSearchCV — scikit-learn 1.4.1 documentation
sklearn.model_selection.GridSearchCV — scikit-learn 1.4.1 documentation
Install
User Guide
API
Examples
Community
Getting Started
Tutorial
What's new
Glossary
Development
FAQ
Support
Related packages
Roadmap
Governance
About us
GitHub
Other Versions and Download
More
Getting Started
Tutorial
What's new
Glossary
Development
FAQ
Support
Related packages
Roadmap
Governance
About us
GitHub
Other Versions and Download
Toggle Menu
PrevUp
Next
scikit-learn 1.4.1
Other versions
Please cite us if you use the software.
sklearn.model_selection.GridSearchCV
GridSearchCV
GridSearchCV.classes_
GridSearchCV.decision_function
GridSearchCV.fit
GridSearchCV.get_metadata_routing
GridSearchCV.get_params
GridSearchCV.inverse_transform
GridSearchCV.n_features_in_
GridSearchCV.predict
GridSearchCV.predict_log_proba
GridSearchCV.predict_proba
GridSearchCV.score
GridSearchCV.score_samples
GridSearchCV.set_params
GridSearchCV.transform
Examples using sklearn.model_selection.GridSearchCV
sklearn.model_selection.GridSearchCV¶
class sklearn.model_selection.GridSearchCV(estimator, param_grid, *, scoring=None, n_jobs=None, refit=True, cv=None, verbose=0, pre_dispatch='2*n_jobs', error_score=nan, return_train_score=False)[source]¶
Exhaustive search over specified parameter values for an estimator.
Important members are fit, predict.
GridSearchCV implements a “fit” and a “score” method.
It also implements “score_samples”, “predict”, “predict_proba”,
“decision_function”, “transform” and “inverse_transform” if they are
implemented in the estimator used.
The parameters of the estimator used to apply these methods are optimized
by cross-validated grid-search over a parameter grid.
Read more in the User Guide.
Parameters:
estimatorestimator objectThis is assumed to implement the scikit-learn estimator interface.
Either estimator needs to provide a score function,
or scoring must be passed.
param_griddict or list of dictionariesDictionary with parameters names (str) as keys and lists of
parameter settings to try as values, or a list of such
dictionaries, in which case the grids spanned by each dictionary
in the list are explored. This enables searching over any sequence
of parameter settings.
scoringstr, callable, list, tuple or dict, default=NoneStrategy to evaluate the performance of the cross-validated model on
the test set.
If scoring represents a single score, one can use:
a single string (see The scoring parameter: defining model evaluation rules);
a callable (see Defining your scoring strategy from metric functions) that returns a single value.
If scoring represents multiple scores, one can use:
a list or tuple of unique strings;
a callable returning a dictionary where the keys are the metric
names and the values are the metric scores;
a dictionary with metric names as keys and callables a values.
See Specifying multiple metrics for evaluation for an example.
n_jobsint, default=NoneNumber of jobs to run in parallel.
None means 1 unless in a joblib.parallel_backend context.
-1 means using all processors. See Glossary
for more details.
Changed in version v0.20: n_jobs default changed from 1 to None
refitbool, str, or callable, default=TrueRefit an estimator using the best found parameters on the whole
dataset.
For multiple metric evaluation, this needs to be a str denoting the
scorer that would be used to find the best parameters for refitting
the estimator at the end.
Where there are considerations other than maximum score in
choosing a best estimator, refit can be set to a function which
returns the selected best_index_ given cv_results_. In that
case, the best_estimator_ and best_params_ will be set
according to the returned best_index_ while the best_score_
attribute will not be available.
The refitted estimator is made available at the best_estimator_
attribute and permits using predict directly on this
GridSearchCV instance.
Also for multiple metric evaluation, the attributes best_index_,
best_score_ and best_params_ will only be available if
refit is set and all of them will be determined w.r.t this specific
scorer.
See scoring parameter to know more about multiple metric
evaluation.
See Custom refit strategy of a grid search with cross-validation
to see how to design a custom selection strategy using a callable
via refit.
Changed in version 0.20: Support for callable added.
cvint, cross-validation generator or an iterable, default=NoneDetermines the cross-validation splitting strategy.
Possible inputs for cv are:
None, to use the default 5-fold cross validation,
integer, to specify the number of folds in a (Stratified)KFold,
CV splitter,
An iterable yielding (train, test) splits as arrays of indices.
For integer/None inputs, if the estimator is a classifier and y is
either binary or multiclass, StratifiedKFold is used. In all
other cases, KFold is used. These splitters are instantiated
with shuffle=False so the splits will be the same across calls.
Refer User Guide for the various
cross-validation strategies that can be used here.
Changed in version 0.22: cv default value if None changed from 3-fold to 5-fold.
verboseintControls the verbosity: the higher, the more messages.
>1 : the computation time for each fold and parameter candidate is
displayed;
>2 : the score is also displayed;
>3 : the fold and candidate parameter indexes are also displayed
together with the starting time of the computation.
pre_dispatchint, or str, default=’2*n_jobs’Controls the number of jobs that get dispatched during parallel
execution. Reducing this number can be useful to avoid an
explosion of memory consumption when more jobs get dispatched
than CPUs can process. This parameter can be:
None, in which case all the jobs are immediately
created and spawned. Use this for lightweight and
fast-running jobs, to avoid delays due to on-demand
spawning of the jobs
An int, giving the exact number of total jobs that are
spawned
A str, giving an expression as a function of n_jobs,
as in ‘2*n_jobs’
error_score‘raise’ or numeric, default=np.nanValue to assign to the score if an error occurs in estimator fitting.
If set to ‘raise’, the error is raised. If a numeric value is given,
FitFailedWarning is raised. This parameter does not affect the refit
step, which will always raise the error.
return_train_scorebool, default=FalseIf False, the cv_results_ attribute will not include training
scores.
Computing training scores is used to get insights on how different
parameter settings impact the overfitting/underfitting trade-off.
However computing the scores on the training set can be computationally
expensive and is not strictly required to select the parameters that
yield the best generalization performance.
New in version 0.19.
Changed in version 0.21: Default value was changed from True to False
Attributes:
cv_results_dict of numpy (masked) ndarraysA dict with keys as column headers and values as columns, that can be
imported into a pandas DataFrame.
For instance the below given table
param_kernel
param_gamma
param_degree
split0_test_score
…
rank_t…
‘poly’
–
2
0.80
…
2
‘poly’
–
3
0.70
…
4
‘rbf’
0.1
–
0.80
…
3
‘rbf’
0.2
–
0.93
…
1
will be represented by a cv_results_ dict of:
{
'param_kernel': masked_array(data = ['poly', 'poly', 'rbf', 'rbf'],
mask = [False False False False]...)
'param_gamma': masked_array(data = [-- -- 0.1 0.2],
mask = [ True True False False]...),
'param_degree': masked_array(data = [2.0 3.0 -- --],
mask = [False False True True]...),
'split0_test_score' : [0.80, 0.70, 0.80, 0.93],
'split1_test_score' : [0.82, 0.50, 0.70, 0.78],
'mean_test_score' : [0.81, 0.60, 0.75, 0.85],
'std_test_score' : [0.01, 0.10, 0.05, 0.08],
'rank_test_score' : [2, 4, 3, 1],
'split0_train_score' : [0.80, 0.92, 0.70, 0.93],
'split1_train_score' : [0.82, 0.55, 0.70, 0.87],
'mean_train_score' : [0.81, 0.74, 0.70, 0.90],
'std_train_score' : [0.01, 0.19, 0.00, 0.03],
'mean_fit_time' : [0.73, 0.63, 0.43, 0.49],
'std_fit_time' : [0.01, 0.02, 0.01, 0.01],
'mean_score_time' : [0.01, 0.06, 0.04, 0.04],
'std_score_time' : [0.00, 0.00, 0.00, 0.01],
'params' : [{'kernel': 'poly', 'degree': 2}, ...],
}
NOTE
The key 'params' is used to store a list of parameter
settings dicts for all the parameter candidates.
The mean_fit_time, std_fit_time, mean_score_time and
std_score_time are all in seconds.
For multi-metric evaluation, the scores for all the scorers are
available in the cv_results_ dict at the keys ending with that
scorer’s name ('_
above. (‘split0_test_precision’, ‘mean_train_precision’ etc.)
best_estimator_estimatorEstimator that was chosen by the search, i.e. estimator
which gave highest score (or smallest loss if specified)
on the left out data. Not available if refit=False.
See refit parameter for more information on allowed values.
best_score_floatMean cross-validated score of the best_estimator
For multi-metric evaluation, this is present only if refit is
specified.
This attribute is not available if refit is a function.
best_params_dictParameter setting that gave the best results on the hold out data.
For multi-metric evaluation, this is present only if refit is
specified.
best_index_intThe index (of the cv_results_ arrays) which corresponds to the best
candidate parameter setting.
The dict at search.cv_results_['params'][search.best_index_] gives
the parameter setting for the best model, that gives the highest
mean score (search.best_score_).
For multi-metric evaluation, this is present only if refit is
specified.
scorer_function or a dictScorer function used on the held out data to choose the best
parameters for the model.
For multi-metric evaluation, this attribute holds the validated
scoring dict which maps the scorer key to the scorer callable.
n_splits_intThe number of cross-validation splits (folds/iterations).
refit_time_floatSeconds used for refitting the best model on the whole dataset.
This is present only if refit is not False.
New in version 0.20.
multimetric_boolWhether or not the scorers compute several metrics.
classes_ndarray of shape (n_classes,)Class labels.
n_features_in_intNumber of features seen during fit.
feature_names_in_ndarray of shape (n_features_in_,)Names of features seen during fit. Only defined if
best_estimator_ is defined (see the documentation for the refit
parameter for more details) and that best_estimator_ exposes
feature_names_in_ when fit.
New in version 1.0.
See also
ParameterGridGenerates all the combinations of a hyperparameter grid.
train_test_splitUtility function to split the data into a development set usable for fitting a GridSearchCV instance and an evaluation set for its final evaluation.
sklearn.metrics.make_scorerMake a scorer from a performance metric or loss function.
Notes
The parameters selected are those that maximize the score of the left out
data, unless an explicit score is passed in which case it is used instead.
If n_jobs was set to a value higher than one, the data is copied for each
point in the grid (and not n_jobs times). This is done for efficiency
reasons if individual jobs take very little time, but may raise errors if
the dataset is large and not enough memory is available. A workaround in
this case is to set pre_dispatch. Then, the memory is copied only
pre_dispatch many times. A reasonable value for pre_dispatch is 2 *
n_jobs.
Examples
>>> from sklearn import svm, datasets
>>> from sklearn.model_selection import GridSearchCV
>>> iris = datasets.load_iris()
>>> parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
>>> svc = svm.SVC()
>>> clf = GridSearchCV(svc, parameters)
>>> clf.fit(iris.data, iris.target)
GridSearchCV(estimator=SVC(),
param_grid={'C': [1, 10], 'kernel': ('linear', 'rbf')})
>>> sorted(clf.cv_results_.keys())
['mean_fit_time', 'mean_score_time', 'mean_test_score',...
'param_C', 'param_kernel', 'params',...
'rank_test_score', 'split0_test_score',...
'split2_test_score', ...
'std_fit_time', 'std_score_time', 'std_test_score']
Methods
decision_function(X)
Call decision_function on the estimator with the best found parameters.
fit(X[, y])
Run fit with all sets of parameters.
get_metadata_routing()
Get metadata routing of this object.
get_params([deep])
Get parameters for this estimator.
inverse_transform(Xt)
Call inverse_transform on the estimator with the best found params.
predict(X)
Call predict on the estimator with the best found parameters.
predict_log_proba(X)
Call predict_log_proba on the estimator with the best found parameters.
predict_proba(X)
Call predict_proba on the estimator with the best found parameters.
score(X[, y])
Return the score on the given data, if the estimator has been refit.
score_samples(X)
Call score_samples on the estimator with the best found parameters.
set_params(**params)
Set the parameters of this estimator.
transform(X)
Call transform on the estimator with the best found parameters.
property classes_¶
Class labels.
Only available when refit=True and the estimator is a classifier.
decision_function(X)[source]¶
Call decision_function on the estimator with the best found parameters.
Only available if refit=True and the underlying estimator supports
decision_function.
Parameters:
Xindexable, length n_samplesMust fulfill the input assumptions of the
underlying estimator.
Returns:
y_scorendarray of shape (n_samples,) or (n_samples, n_classes) or (n_samples, n_classes * (n_classes-1) / 2)Result of the decision function for X based on the estimator with
the best found parameters.
fit(X, y=None, **params)[source]¶
Run fit with all sets of parameters.
Parameters:
Xarray-like of shape (n_samples, n_features)Training vector, where n_samples is the number of samples and
n_features is the number of features.
yarray-like of shape (n_samples, n_output) or (n_samples,), default=NoneTarget relative to X for classification or regression;
None for unsupervised learning.
**paramsdict of str -> objectParameters passed to the fit method of the estimator, the scorer,
and the CV splitter.
If a fit parameter is an array-like whose length is equal to
num_samples then it will be split across CV groups along with X
and y. For example, the sample_weight parameter is split
because len(sample_weights) = len(X).
Returns:
selfobjectInstance of fitted estimator.
get_metadata_routing()[source]¶
Get metadata routing of this object.
Please check User Guide on how the routing
mechanism works.
New in version 1.4.
Returns:
routingMetadataRouterA MetadataRouter encapsulating
routing information.
get_params(deep=True)[source]¶
Get parameters for this estimator.
Parameters:
deepbool, default=TrueIf True, will return the parameters for this estimator and
contained subobjects that are estimators.
Returns:
paramsdictParameter names mapped to their values.
inverse_transform(Xt)[source]¶
Call inverse_transform on the estimator with the best found params.
Only available if the underlying estimator implements
inverse_transform and refit=True.
Parameters:
Xtindexable, length n_samplesMust fulfill the input assumptions of the
underlying estimator.
Returns:
X{ndarray, sparse matrix} of shape (n_samples, n_features)Result of the inverse_transform function for Xt based on the
estimator with the best found parameters.
property n_features_in_¶
Number of features seen during fit.
Only available when refit=True.
predict(X)[source]¶
Call predict on the estimator with the best found parameters.
Only available if refit=True and the underlying estimator supports
predict.
Parameters:
Xindexable, length n_samplesMust fulfill the input assumptions of the
underlying estimator.
Returns:
y_predndarray of shape (n_samples,)The predicted labels or values for X based on the estimator with
the best found parameters.
predict_log_proba(X)[source]¶
Call predict_log_proba on the estimator with the best found parameters.
Only available if refit=True and the underlying estimator supports
predict_log_proba.
Parameters:
Xindexable, length n_samplesMust fulfill the input assumptions of the
underlying estimator.
Returns:
y_predndarray of shape (n_samples,) or (n_samples, n_classes)Predicted class log-probabilities for X based on the estimator
with the best found parameters. The order of the classes
corresponds to that in the fitted attribute classes_.
predict_proba(X)[source]¶
Call predict_proba on the estimator with the best found parameters.
Only available if refit=True and the underlying estimator supports
predict_proba.
Parameters:
Xindexable, length n_samplesMust fulfill the input assumptions of the
underlying estimator.
Returns:
y_predndarray of shape (n_samples,) or (n_samples, n_classes)Predicted class probabilities for X based on the estimator with
the best found parameters. The order of the classes corresponds
to that in the fitted attribute classes_.
score(X, y=None, **params)[source]¶
Return the score on the given data, if the estimator has been refit.
This uses the score defined by scoring where provided, and the
best_estimator_.score method otherwise.
Parameters:
Xarray-like of shape (n_samples, n_features)Input data, where n_samples is the number of samples and
n_features is the number of features.
yarray-like of shape (n_samples, n_output) or (n_samples,), default=NoneTarget relative to X for classification or regression;
None for unsupervised learning.
**paramsdictParameters to be passed to the underlying scorer(s).
..versionadded:: 1.4Only available if enable_metadata_routing=True. See
Metadata Routing User Guide for more
details.
Returns:
scorefloatThe score defined by scoring if provided, and the
best_estimator_.score method otherwise.
score_samples(X)[source]¶
Call score_samples on the estimator with the best found parameters.
Only available if refit=True and the underlying estimator supports
score_samples.
New in version 0.24.
Parameters:
XiterableData to predict on. Must fulfill input requirements
of the underlying estimator.
Returns:
y_scorendarray of shape (n_samples,)The best_estimator_.score_samples method.
set_params(**params)[source]¶
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects
(such as Pipeline). The latter have
parameters of the form
possible to update each component of a nested object.
Parameters:
**paramsdictEstimator parameters.
Returns:
selfestimator instanceEstimator instance.
transform(X)[source]¶
Call transform on the estimator with the best found parameters.
Only available if the underlying estimator supports transform and
refit=True.
Parameters:
Xindexable, length n_samplesMust fulfill the input assumptions of the
underlying estimator.
Returns:
Xt{ndarray, sparse matrix} of shape (n_samples, n_features)X transformed in the new space based on the estimator with
the best found parameters.
Examples using sklearn.model_selection.GridSearchCV¶
Release Highlights for scikit-learn 1.4
Release Highlights for scikit-learn 1.4
Release Highlights for scikit-learn 0.24
Release Highlights for scikit-learn 0.24
Feature agglomeration vs. univariate selection
Feature agglomeration vs. univariate selection
Shrinkage covariance estimation: LedoitWolf vs OAS and max-likelihood
Shrinkage covariance estimation: LedoitWolf vs OAS and max-likelihood
Model selection with Probabilistic PCA and Factor Analysis (FA)
Model selection with Probabilistic PCA and Factor Analysis (FA)
Comparing Random Forests and Histogram Gradient Boosting models
Comparing Random Forests and Histogram Gradient Boosting models
Gaussian Mixture Model Selection
Gaussian Mixture Model Selection
Comparison of kernel ridge regression and SVR
Comparison of kernel ridge regression and SVR
Displaying Pipelines
Displaying Pipelines
Balance model complexity and cross-validated score
Balance model complexity and cross-validated score
Comparing randomized search and grid search for hyperparameter estimation
Comparing randomized search and grid search for hyperparameter estimation
Comparison between grid search and successive halving
Comparison between grid search and successive halving
Custom refit strategy of a grid search with cross-validation
Custom refit strategy of a grid search with cross-validation
Demonstration of multi-metric evaluation on cross_val_score and GridSearchCV
Demonstration of multi-metric evaluation on cross_val_score and GridSearchCV
Nested versus non-nested cross-validation
Nested versus non-nested cross-validation
Sample pipeline for text feature extraction and evaluation
Sample pipeline for text feature extraction and evaluation
Statistical comparison of models using grid search
Statistical comparison of models using grid search
Overview of multiclass training meta-estimators
Overview of multiclass training meta-estimators
Caching nearest neighbors
Caching nearest neighbors
Kernel Density Estimation
Kernel Density Estimation
Column Transformer with Mixed Types
Column Transformer with Mixed Types
Concatenating multiple feature extraction methods
Concatenating multiple feature extraction methods
Pipelining: chaining a PCA and a logistic regression
Pipelining: chaining a PCA and a logistic regression
Selecting dimensionality reduction with Pipeline and GridSearchCV
Selecting dimensionality reduction with Pipeline and GridSearchCV
Feature discretization
Feature discretization
Plot classification boundaries with different SVM Kernels
Plot classification boundaries with different SVM Kernels
RBF SVM parameters
RBF SVM parameters
Cross-validation on diabetes Dataset Exercise
Cross-validation on diabetes Dataset Exercise
© 2007 - 2024, scikit-learn developers (BSD License).
Show this page source
2.3. Clustering — scikit-learn 1.4.1 documentation
2.3. Clustering — scikit-learn 1.4.1 documentation
Install
User Guide
API
Examples
Community
Getting Started
Tutorial
What's new
Glossary
Development
FAQ
Support
Related packages
Roadmap
Governance
About us
GitHub
Other Versions and Download
More
Getting Started
Tutorial
What's new
Glossary
Development
FAQ
Support
Related packages
Roadmap
Governance
About us
GitHub
Other Versions and Download
Toggle Menu
PrevUp
Next
scikit-learn 1.4.1
Other versions
Please cite us if you use the software.
2.3. Clustering
2.3.1. Overview of clustering methods
2.3.2. K-means
2.3.2.1. Low-level parallelism
2.3.2.2. Mini Batch K-Means
2.3.3. Affinity Propagation
2.3.4. Mean Shift
2.3.5. Spectral clustering
2.3.5.1. Different label assignment strategies
2.3.5.2. Spectral Clustering Graphs
2.3.6. Hierarchical clustering
2.3.6.1. Different linkage type: Ward, complete, average, and single linkage
2.3.6.2. Visualization of cluster hierarchy
2.3.6.3. Adding connectivity constraints
2.3.6.4. Varying the metric
2.3.6.5. Bisecting K-Means
2.3.7. DBSCAN
2.3.8. HDBSCAN
2.3.8.1. Mutual Reachability Graph
2.3.8.2. Hierarchical Clustering
2.3.9. OPTICS
2.3.10. BIRCH
2.3.11. Clustering performance evaluation
2.3.11.1. Rand index
2.3.11.1.1. Advantages
2.3.11.1.2. Drawbacks
2.3.11.1.3. Mathematical formulation
2.3.11.2. Mutual Information based scores
2.3.11.2.1. Advantages
2.3.11.2.2. Drawbacks
2.3.11.2.3. Mathematical formulation
2.3.11.3. Homogeneity, completeness and V-measure
2.3.11.3.1. Advantages
2.3.11.3.2. Drawbacks
2.3.11.3.3. Mathematical formulation
2.3.11.4. Fowlkes-Mallows scores
2.3.11.4.1. Advantages
2.3.11.4.2. Drawbacks
2.3.11.5. Silhouette Coefficient
2.3.11.5.1. Advantages
2.3.11.5.2. Drawbacks
2.3.11.6. Calinski-Harabasz Index
2.3.11.6.1. Advantages
2.3.11.6.2. Drawbacks
2.3.11.6.3. Mathematical formulation
2.3.11.7. Davies-Bouldin Index
2.3.11.7.1. Advantages
2.3.11.7.2. Drawbacks
2.3.11.7.3. Mathematical formulation
2.3.11.8. Contingency Matrix
2.3.11.8.1. Advantages
2.3.11.8.2. Drawbacks
2.3.11.9. Pair Confusion Matrix
2.3. Clustering¶
Clustering of
unlabeled data can be performed with the module sklearn.cluster.
Each clustering algorithm comes in two variants: a class, that implements
the fit method to learn the clusters on train data, and a function,
that, given train data, returns an array of integer labels corresponding
to the different clusters. For the class, the labels over the training
data can be found in the labels_ attribute.
Input data
One important thing to note is that the algorithms implemented in
this module can take different kinds of matrix as input. All the
methods accept standard data matrices of shape (n_samples, n_features).
These can be obtained from the classes in the sklearn.feature_extraction
module. For AffinityPropagation, SpectralClustering
and DBSCAN one can also input similarity matrices of shape
(n_samples, n_samples). These can be obtained from the functions
in the sklearn.metrics.pairwise module.
2.3.1. Overview of clustering methods¶
A comparison of the clustering algorithms in scikit-learn¶
Method name
Parameters
Scalability
Usecase
Geometry (metric used)
K-Means
number of clusters
Very large n_samples, medium n_clusters with
MiniBatch code
General-purpose, even cluster size, flat geometry,
not too many clusters, inductive
Distances between points
Affinity propagation
damping, sample preference
Not scalable with n_samples
Many clusters, uneven cluster size, non-flat geometry, inductive
Graph distance (e.g. nearest-neighbor graph)
Mean-shift
bandwidth
Not scalable with n_samples
Many clusters, uneven cluster size, non-flat geometry, inductive
Distances between points
Spectral clustering
number of clusters
Medium n_samples, small n_clusters
Few clusters, even cluster size, non-flat geometry, transductive
Graph distance (e.g. nearest-neighbor graph)
Ward hierarchical clustering
number of clusters or distance threshold
Large n_samples and n_clusters
Many clusters, possibly connectivity constraints, transductive
Distances between points
Agglomerative clustering
number of clusters or distance threshold, linkage type, distance
Large n_samples and n_clusters
Many clusters, possibly connectivity constraints, non Euclidean
distances, transductive
Any pairwise distance
DBSCAN
neighborhood size
Very large n_samples, medium n_clusters
Non-flat geometry, uneven cluster sizes, outlier removal,
transductive
Distances between nearest points
HDBSCAN
minimum cluster membership, minimum point neighbors
large n_samples, medium n_clusters
Non-flat geometry, uneven cluster sizes, outlier removal,
transductive, hierarchical, variable cluster density
Distances between nearest points
OPTICS
minimum cluster membership
Very large n_samples, large n_clusters
Non-flat geometry, uneven cluster sizes, variable cluster density,
outlier removal, transductive
Distances between points
Gaussian mixtures
many
Not scalable
Flat geometry, good for density estimation, inductive
Mahalanobis distances to centers
BIRCH
branching factor, threshold, optional global clusterer.
Large n_clusters and n_samples
Large dataset, outlier removal, data reduction, inductive
Euclidean distance between points
Bisecting K-Means
number of clusters
Very large n_samples, medium n_clusters
General-purpose, even cluster size, flat geometry,
no empty clusters, inductive, hierarchical
Distances between points
Non-flat geometry clustering is useful when the clusters have a specific
shape, i.e. a non-flat manifold, and the standard euclidean distance is
not the right metric. This case arises in the two top rows of the figure
above.
Gaussian mixture models, useful for clustering, are described in
another chapter of the documentation dedicated to
mixture models. KMeans can be seen as a special case of Gaussian mixture
model with equal covariance per component.
Transductive clustering methods (in contrast to
inductive clustering methods) are not designed to be applied to new,
unseen data.
2.3.2. K-means¶
The KMeans algorithm clusters data by trying to separate samples in n
groups of equal variance, minimizing a criterion known as the inertia or
within-cluster sum-of-squares (see below). This algorithm requires the number
of clusters to be specified. It scales well to large numbers of samples and has
been used across a large range of application areas in many different fields.
The k-means algorithm divides a set of \(N\) samples \(X\) into
\(K\) disjoint clusters \(C\), each described by the mean \(\mu_j\)
of the samples in the cluster. The means are commonly called the cluster
“centroids”; note that they are not, in general, points from \(X\),
although they live in the same space.
The K-means algorithm aims to choose centroids that minimise the inertia,
or within-cluster sum-of-squares criterion:
\[\sum_{i=0}^{n}\min_{\mu_j \in C}(||x_i - \mu_j||^2)\]
Inertia can be recognized as a measure of how internally coherent clusters are.
It suffers from various drawbacks:
Inertia makes the assumption that clusters are convex and isotropic,
which is not always the case. It responds poorly to elongated clusters,
or manifolds with irregular shapes.
Inertia is not a normalized metric: we just know that lower values are
better and zero is optimal. But in very high-dimensional spaces, Euclidean
distances tend to become inflated
(this is an instance of the so-called “curse of dimensionality”).
Running a dimensionality reduction algorithm such as Principal component analysis (PCA) prior to
k-means clustering can alleviate this problem and speed up the
computations.
For more detailed descriptions of the issues shown above and how to address them,
refer to the examples Demonstration of k-means assumptions
and Selecting the number of clusters with silhouette analysis on KMeans clustering.
K-means is often referred to as Lloyd’s algorithm. In basic terms, the
algorithm has three steps. The first step chooses the initial centroids, with
the most basic method being to choose \(k\) samples from the dataset
\(X\). After initialization, K-means consists of looping between the
two other steps. The first step assigns each sample to its nearest centroid.
The second step creates new centroids by taking the mean value of all of the
samples assigned to each previous centroid. The difference between the old
and the new centroids are computed and the algorithm repeats these last two
steps until this value is less than a threshold. In other words, it repeats
until the centroids do not move significantly.
K-means is equivalent to the expectation-maximization algorithm
with a small, all-equal, diagonal covariance matrix.
The algorithm can also be understood through the concept of Voronoi diagrams. First the Voronoi diagram of
the points is calculated using the current centroids. Each segment in the
Voronoi diagram becomes a separate cluster. Secondly, the centroids are updated
to the mean of each segment. The algorithm then repeats this until a stopping
criterion is fulfilled. Usually, the algorithm stops when the relative decrease
in the objective function between iterations is less than the given tolerance
value. This is not the case in this implementation: iteration stops when
centroids move less than the tolerance.
Given enough time, K-means will always converge, however this may be to a local
minimum. This is highly dependent on the initialization of the centroids.
As a result, the computation is often done several times, with different
initializations of the centroids. One method to help address this issue is the
k-means++ initialization scheme, which has been implemented in scikit-learn
(use the init='k-means++' parameter). This initializes the centroids to be
(generally) distant from each other, leading to probably better results than
random initialization, as shown in the reference. For a detailed example of
comaparing different initialization schemes, refer to
A demo of K-Means clustering on the handwritten digits data.
K-means++ can also be called independently to select seeds for other
clustering algorithms, see sklearn.cluster.kmeans_plusplus for details
and example usage.
The algorithm supports sample weights, which can be given by a parameter
sample_weight. This allows to assign more weight to some samples when
computing cluster centers and values of inertia. For example, assigning a
weight of 2 to a sample is equivalent to adding a duplicate of that sample
to the dataset \(X\).
K-means can be used for vector quantization. This is achieved using the
transform method of a trained model of KMeans. For an example of
performing vector quantization on an image refer to
Color Quantization using K-Means.
Examples:
K-means Clustering: Example usage of
KMeans using the iris dataset
Clustering text documents using k-means: Document clustering
using KMeans and MiniBatchKMeans based on sparse data
2.3.2.1. Low-level parallelism¶
KMeans benefits from OpenMP based parallelism through Cython. Small
chunks of data (256 samples) are processed in parallel, which in addition
yields a low memory footprint. For more details on how to control the number of
threads, please refer to our Parallelism notes.
Examples:
Demonstration of k-means assumptions: Demonstrating when
k-means performs intuitively and when it does not
A demo of K-Means clustering on the handwritten digits data: Clustering handwritten digits
References:
“k-means++: The advantages of careful seeding”
Arthur, David, and Sergei Vassilvitskii,
Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete
algorithms, Society for Industrial and Applied Mathematics (2007)
2.3.2.2. Mini Batch K-Means¶
The MiniBatchKMeans is a variant of the KMeans algorithm
which uses mini-batches to reduce the computation time, while still attempting
to optimise the same objective function. Mini-batches are subsets of the input
data, randomly sampled in each training iteration. These mini-batches
drastically reduce the amount of computation required to converge to a local
solution. In contrast to other algorithms that reduce the convergence time of
k-means, mini-batch k-means produces results that are generally only slightly
worse than the standard algorithm.
The algorithm iterates between two major steps, similar to vanilla k-means.
In the first step, \(b\) samples are drawn randomly from the dataset, to form
a mini-batch. These are then assigned to the nearest centroid. In the second
step, the centroids are updated. In contrast to k-means, this is done on a
per-sample basis. For each sample in the mini-batch, the assigned centroid
is updated by taking the streaming average of the sample and all previous
samples assigned to that centroid. This has the effect of decreasing the
rate of change for a centroid over time. These steps are performed until
convergence or a predetermined number of iterations is reached.
MiniBatchKMeans converges faster than KMeans, but the quality
of the results is reduced. In practice this difference in quality can be quite
small, as shown in the example and cited reference.
Examples:
Comparison of the K-Means and MiniBatchKMeans clustering algorithms: Comparison of
KMeans and MiniBatchKMeans
Clustering text documents using k-means: Document clustering
using KMeans and MiniBatchKMeans based on sparse data
Online learning of a dictionary of parts of faces
References:
“Web Scale K-Means clustering”
D. Sculley, Proceedings of the 19th international conference on World
wide web (2010)
2.3.3. Affinity Propagation¶
AffinityPropagation creates clusters by sending messages between
pairs of samples until convergence. A dataset is then described using a small
number of exemplars, which are identified as those most representative of other
samples. The messages sent between pairs represent the suitability for one
sample to be the exemplar of the other, which is updated in response to the
values from other pairs. This updating happens iteratively until convergence,
at which point the final exemplars are chosen, and hence the final clustering
is given.
Affinity Propagation can be interesting as it chooses the number of
clusters based on the data provided. For this purpose, the two important
parameters are the preference, which controls how many exemplars are
used, and the damping factor which damps the responsibility and
availability messages to avoid numerical oscillations when updating these
messages.
The main drawback of Affinity Propagation is its complexity. The
algorithm has a time complexity of the order \(O(N^2 T)\), where \(N\)
is the number of samples and \(T\) is the number of iterations until
convergence. Further, the memory complexity is of the order
\(O(N^2)\) if a dense similarity matrix is used, but reducible if a
sparse similarity matrix is used. This makes Affinity Propagation most
appropriate for small to medium sized datasets.
Examples:
Demo of affinity propagation clustering algorithm: Affinity
Propagation on a synthetic 2D datasets with 3 classes.
Visualizing the stock market structure Affinity Propagation on
Financial time series to find groups of companies
Algorithm description:
The messages sent between points belong to one of two categories. The first is
the responsibility \(r(i, k)\),
which is the accumulated evidence that sample \(k\)
should be the exemplar for sample \(i\).
The second is the availability \(a(i, k)\)
which is the accumulated evidence that sample \(i\)
should choose sample \(k\) to be its exemplar,
and considers the values for all other samples that \(k\) should
be an exemplar. In this way, exemplars are chosen by samples if they are (1)
similar enough to many samples and (2) chosen by many samples to be
representative of themselves.
More formally, the responsibility of a sample \(k\)
to be the exemplar of sample \(i\) is given by:
\[r(i, k) \leftarrow s(i, k) - max [ a(i, k') + s(i, k') \forall k' \neq k ]\]
Where \(s(i, k)\) is the similarity between samples \(i\) and \(k\).
The availability of sample \(k\)
to be the exemplar of sample \(i\) is given by:
\[a(i, k) \leftarrow min [0, r(k, k) + \sum_{i'~s.t.~i' \notin \{i, k\}}{r(i', k)}]\]
To begin with, all values for \(r\) and \(a\) are set to zero,
and the calculation of each iterates until convergence.
As discussed above, in order to avoid numerical oscillations when updating the
messages, the damping factor \(\lambda\) is introduced to iteration process:
\[r_{t+1}(i, k) = \lambda\cdot r_{t}(i, k) + (1-\lambda)\cdot r_{t+1}(i, k)\]
\[a_{t+1}(i, k) = \lambda\cdot a_{t}(i, k) + (1-\lambda)\cdot a_{t+1}(i, k)\]
where \(t\) indicates the iteration times.
2.3.4. Mean Shift¶
MeanShift clustering aims to discover blobs in a smooth density of
samples. It is a centroid based algorithm, which works by updating candidates
for centroids to be the mean of the points within a given region. These
candidates are then filtered in a post-processing stage to eliminate
near-duplicates to form the final set of centroids.
The position of centroid candidates is iteratively adjusted using a technique called hill
climbing, which finds local maxima of the estimated probability density.
Given a candidate centroid \(x\) for iteration \(t\), the candidate
is updated according to the following equation:
\[x^{t+1} = x^t + m(x^t)\]
Where \(m\) is the mean shift vector that is computed for each
centroid that points towards a region of the maximum increase in the density of points.
To compute \(m\) we define \(N(x)\) as the neighborhood of samples within
a given distance around \(x\). Then \(m\) is computed using the following
equation, effectively updating a centroid to be the mean of the samples within
its neighborhood:
\[m(x) = \frac{1}{|N(x)|} \sum_{x_j \in N(x)}x_j - x\]
In general, the equation for \(m\) depends on a kernel used for density estimation.
The generic formula is:
\[m(x) = \frac{\sum_{x_j \in N(x)}K(x_j - x)x_j}{\sum_{x_j \in N(x)}K(x_j - x)} - x\]
In our implementation, \(K(x)\) is equal to 1 if \(x\) is small enough and is
equal to 0 otherwise. Effectively \(K(y - x)\) indicates whether \(y\) is in
the neighborhood of \(x\).
The algorithm automatically sets the number of clusters, instead of relying on a
parameter bandwidth, which dictates the size of the region to search through.
This parameter can be set manually, but can be estimated using the provided
estimate_bandwidth function, which is called if the bandwidth is not set.
The algorithm is not highly scalable, as it requires multiple nearest neighbor
searches during the execution of the algorithm. The algorithm is guaranteed to
converge, however the algorithm will stop iterating when the change in centroids
is small.
Labelling a new sample is performed by finding the nearest centroid for a
given sample.
Examples:
A demo of the mean-shift clustering algorithm: Mean Shift clustering
on a synthetic 2D datasets with 3 classes.
References:
“Mean shift: A robust approach toward feature space analysis”
D. Comaniciu and P. Meer, IEEE Transactions on Pattern Analysis and Machine Intelligence (2002)
2.3.5. Spectral clustering¶
SpectralClustering performs a low-dimension embedding of the
affinity matrix between samples, followed by clustering, e.g., by KMeans,
of the components of the eigenvectors in the low dimensional space.
It is especially computationally efficient if the affinity matrix is sparse
and the amg solver is used for the eigenvalue problem (Note, the amg solver
requires that the pyamg module is installed.)
The present version of SpectralClustering requires the number of clusters
to be specified in advance. It works well for a small number of clusters,
but is not advised for many clusters.
For two clusters, SpectralClustering solves a convex relaxation of the
normalized cuts
problem on the similarity graph: cutting the graph in two so that the weight of
the edges cut is small compared to the weights of the edges inside each
cluster. This criteria is especially interesting when working on images, where
graph vertices are pixels, and weights of the edges of the similarity graph are
computed using a function of a gradient of the image.
Warning
Transforming distance to well-behaved similarities
Note that if the values of your similarity matrix are not well
distributed, e.g. with negative values or with a distance matrix
rather than a similarity, the spectral problem will be singular and
the problem not solvable. In which case it is advised to apply a
transformation to the entries of the matrix. For instance, in the
case of a signed distance matrix, is common to apply a heat kernel:
similarity = np.exp(-beta * distance / distance.std())
See the examples for such an application.
Examples:
Spectral clustering for image segmentation: Segmenting objects
from a noisy background using spectral clustering.
Segmenting the picture of greek coins in regions: Spectral clustering
to split the image of coins in regions.
2.3.5.1. Different label assignment strategies¶
Different label assignment strategies can be used, corresponding to the
assign_labels parameter of SpectralClustering.
"kmeans" strategy can match finer details, but can be unstable.
In particular, unless you control the random_state, it may not be
reproducible from run-to-run, as it depends on random initialization.
The alternative "discretize" strategy is 100% reproducible, but tends
to create parcels of fairly even and geometrical shape.
The recently added "cluster_qr" option is a deterministic alternative that
tends to create the visually best partitioning on the example application
below.
assign_labels="kmeans"
assign_labels="discretize"
assign_labels="cluster_qr"
References:
“Multiclass spectral clustering”
Stella X. Yu, Jianbo Shi, 2003
“Simple, direct, and efficient multi-way spectral clustering”
Anil Damle, Victor Minden, Lexing Ying, 2019
2.3.5.2. Spectral Clustering Graphs¶
Spectral Clustering can also be used to partition graphs via their spectral
embeddings. In this case, the affinity matrix is the adjacency matrix of the
graph, and SpectralClustering is initialized with affinity='precomputed':
>>> from sklearn.cluster import SpectralClustering
>>> sc = SpectralClustering(3, affinity='precomputed', n_init=100,
... assign_labels='discretize')
>>> sc.fit_predict(adjacency_matrix)
References:
“A Tutorial on Spectral Clustering”
Ulrike von Luxburg, 2007
“Normalized cuts and image segmentation”
Jianbo Shi, Jitendra Malik, 2000
“A Random Walks View of Spectral Segmentation”
Marina Meila, Jianbo Shi, 2001
“On Spectral Clustering: Analysis and an algorithm”
Andrew Y. Ng, Michael I. Jordan, Yair Weiss, 2001
“Preconditioned Spectral Clustering for Stochastic
Block Partition Streaming Graph Challenge”
David Zhuzhunashvili, Andrew Knyazev
2.3.6. Hierarchical clustering¶
Hierarchical clustering is a general family of clustering algorithms that
build nested clusters by merging or splitting them successively. This
hierarchy of clusters is represented as a tree (or dendrogram). The root of the
tree is the unique cluster that gathers all the samples, the leaves being the
clusters with only one sample. See the Wikipedia page for more details.
The AgglomerativeClustering object performs a hierarchical clustering
using a bottom up approach: each observation starts in its own cluster, and
clusters are successively merged together. The linkage criteria determines the
metric used for the merge strategy:
Ward minimizes the sum of squared differences within all clusters. It is a
variance-minimizing approach and in this sense is similar to the k-means
objective function but tackled with an agglomerative hierarchical
approach.
Maximum or complete linkage minimizes the maximum distance between
observations of pairs of clusters.
Average linkage minimizes the average of the distances between all
observations of pairs of clusters.
Single linkage minimizes the distance between the closest
observations of pairs of clusters.
AgglomerativeClustering can also scale to large number of samples
when it is used jointly with a connectivity matrix, but is computationally
expensive when no connectivity constraints are added between samples: it
considers at each step all the possible merges.
FeatureAgglomeration
The FeatureAgglomeration uses agglomerative clustering to
group together features that look very similar, thus decreasing the
number of features. It is a dimensionality reduction tool, see
Unsupervised dimensionality reduction.
2.3.6.1. Different linkage type: Ward, complete, average, and single linkage¶
AgglomerativeClustering supports Ward, single, average, and complete
linkage strategies.
Agglomerative cluster has a “rich get richer” behavior that leads to
uneven cluster sizes. In this regard, single linkage is the worst
strategy, and Ward gives the most regular sizes. However, the affinity
(or distance used in clustering) cannot be varied with Ward, thus for non
Euclidean metrics, average linkage is a good alternative. Single linkage,
while not robust to noisy data, can be computed very efficiently and can
therefore be useful to provide hierarchical clustering of larger datasets.
Single linkage can also perform well on non-globular data.
Examples:
Various Agglomerative Clustering on a 2D embedding of digits: exploration of the
different linkage strategies in a real dataset.
2.3.6.2. Visualization of cluster hierarchy¶
It’s possible to visualize the tree representing the hierarchical merging of clusters
as a dendrogram. Visual inspection can often be useful for understanding the structure
of the data, though more so in the case of small sample sizes.
2.3.6.3. Adding connectivity constraints¶
An interesting aspect of AgglomerativeClustering is that
connectivity constraints can be added to this algorithm (only adjacent
clusters can be merged together), through a connectivity matrix that defines
for each sample the neighboring samples following a given structure of the
data. For instance, in the swiss-roll example below, the connectivity
constraints forbid the merging of points that are not adjacent on the swiss
roll, and thus avoid forming clusters that extend across overlapping folds of
the roll.
These constraint are useful to impose a certain local structure, but they
also make the algorithm faster, especially when the number of the samples
is high.
The connectivity constraints are imposed via an connectivity matrix: a
scipy sparse matrix that has elements only at the intersection of a row
and a column with indices of the dataset that should be connected. This
matrix can be constructed from a-priori information: for instance, you
may wish to cluster web pages by only merging pages with a link pointing
from one to another. It can also be learned from the data, for instance
using sklearn.neighbors.kneighbors_graph to restrict
merging to nearest neighbors as in this example, or
using sklearn.feature_extraction.image.grid_to_graph to
enable only merging of neighboring pixels on an image, as in the
coin example.
Examples:
A demo of structured Ward hierarchical clustering on an image of coins: Ward clustering
to split the image of coins in regions.
Hierarchical clustering: structured vs unstructured ward: Example of
Ward algorithm on a swiss-roll, comparison of structured approaches
versus unstructured approaches.
Feature agglomeration vs. univariate selection:
Example of dimensionality reduction with feature agglomeration based on
Ward hierarchical clustering.
Agglomerative clustering with and without structure
Warning
Connectivity constraints with single, average and complete linkage
Connectivity constraints and single, complete or average linkage can enhance
the ‘rich getting richer’ aspect of agglomerative clustering,
particularly so if they are built with
sklearn.neighbors.kneighbors_graph. In the limit of a small
number of clusters, they tend to give a few macroscopically occupied
clusters and almost empty ones. (see the discussion in
Agglomerative clustering with and without structure).
Single linkage is the most brittle linkage option with regard to this issue.
2.3.6.4. Varying the metric¶
Single, average and complete linkage can be used with a variety of distances (or
affinities), in particular Euclidean distance (l2), Manhattan distance
(or Cityblock, or l1), cosine distance, or any precomputed affinity
matrix.
l1 distance is often good for sparse features, or sparse noise: i.e.
many of the features are zero, as in text mining using occurrences of
rare words.
cosine distance is interesting because it is invariant to global
scalings of the signal.
The guidelines for choosing a metric is to use one that maximizes the
distance between samples in different classes, and minimizes that within
each class.
Examples:
Agglomerative clustering with different metrics
2.3.6.5. Bisecting K-Means¶
The BisectingKMeans is an iterative variant of KMeans, using
divisive hierarchical clustering. Instead of creating all centroids at once, centroids
are picked progressively based on a previous clustering: a cluster is split into two
new clusters repeatedly until the target number of clusters is reached.
BisectingKMeans is more efficient than KMeans when the number of
clusters is large since it only works on a subset of the data at each bisection
while KMeans always works on the entire dataset.
Although BisectingKMeans can’t benefit from the advantages of the "k-means++"
initialization by design, it will still produce comparable results than
KMeans(init="k-means++") in terms of inertia at cheaper computational costs, and will
likely produce better results than KMeans with a random initialization.
This variant is more efficient to agglomerative clustering if the number of clusters is
small compared to the number of data points.
This variant also does not produce empty clusters.
There exist two strategies for selecting the cluster to split:
bisecting_strategy="largest_cluster" selects the cluster having the most points
bisecting_strategy="biggest_inertia" selects the cluster with biggest inertia
(cluster with biggest Sum of Squared Errors within)
Picking by largest amount of data points in most cases produces result as
accurate as picking by inertia and is faster (especially for larger amount of data
points, where calculating error may be costly).
Picking by largest amount of data points will also likely produce clusters of similar
sizes while KMeans is known to produce clusters of different sizes.
Difference between Bisecting K-Means and regular K-Means can be seen on example
Bisecting K-Means and Regular K-Means Performance Comparison.
While the regular K-Means algorithm tends to create non-related clusters,
clusters from Bisecting K-Means are well ordered and create quite a visible hierarchy.
References:
“A Comparison of Document Clustering Techniques”
Michael Steinbach, George Karypis and Vipin Kumar,
Department of Computer Science and Egineering, University of Minnesota
(June 2000)
“Performance Analysis of K-Means and Bisecting K-Means Algorithms in Weblog Data”
K.Abirami and Dr.P.Mayilvahanan,
International Journal of Emerging Technologies in Engineering Research (IJETER)
Volume 4, Issue 8, (August 2016)
“Bisecting K-means Algorithm Based on K-valued Self-determining
and Clustering Center Optimization”
Jian Di, Xinyue Gou
School of Control and Computer Engineering,North China Electric Power University,
Baoding, Hebei, China (August 2017)
2.3.7. DBSCAN¶
The DBSCAN algorithm views clusters as areas of high density
separated by areas of low density. Due to this rather generic view, clusters
found by DBSCAN can be any shape, as opposed to k-means which assumes that
clusters are convex shaped. The central component to the DBSCAN is the concept
of core samples, which are samples that are in areas of high density. A
cluster is therefore a set of core samples, each close to each other
(measured by some distance measure)
and a set of non-core samples that are close to a core sample (but are not
themselves core samples). There are two parameters to the algorithm,
min_samples and eps,
which define formally what we mean when we say dense.
Higher min_samples or lower eps
indicate higher density necessary to form a cluster.
More formally, we define a core sample as being a sample in the dataset such
that there exist min_samples other samples within a distance of
eps, which are defined as neighbors of the core sample. This tells
us that the core sample is in a dense area of the vector space. A cluster
is a set of core samples that can be built by recursively taking a core
sample, finding all of its neighbors that are core samples, finding all of
their neighbors that are core samples, and so on. A cluster also has a
set of non-core samples, which are samples that are neighbors of a core sample
in the cluster but are not themselves core samples. Intuitively, these samples
are on the fringes of a cluster.
Any core sample is part of a cluster, by definition. Any sample that is not a
core sample, and is at least eps in distance from any core sample, is
considered an outlier by the algorithm.
While the parameter min_samples primarily controls how tolerant the
algorithm is towards noise (on noisy and large data sets it may be desirable
to increase this parameter), the parameter eps is crucial to choose
appropriately for the data set and distance function and usually cannot be
left at the default value. It controls the local neighborhood of the points.
When chosen too small, most data will not be clustered at all (and labeled
as -1 for “noise”). When chosen too large, it causes close clusters to
be merged into one cluster, and eventually the entire data set to be returned
as a single cluster. Some heuristics for choosing this parameter have been
discussed in the literature, for example based on a knee in the nearest neighbor
distances plot (as discussed in the references below).
In the figure below, the color indicates cluster membership, with large circles
indicating core samples found by the algorithm. Smaller circles are non-core
samples that are still part of a cluster. Moreover, the outliers are indicated
by black points below.
Examples:
Demo of DBSCAN clustering algorithm
Implementation
The DBSCAN algorithm is deterministic, always generating the same clusters
when given the same data in the same order. However, the results can differ when
data is provided in a different order. First, even though the core samples
will always be assigned to the same clusters, the labels of those clusters
will depend on the order in which those samples are encountered in the data.
Second and more importantly, the clusters to which non-core samples are assigned
can differ depending on the data order. This would happen when a non-core sample
has a distance lower than eps to two core samples in different clusters. By the
triangular inequality, those two core samples must be more distant than
eps from each other, or they would be in the same cluster. The non-core
sample is assigned to whichever cluster is generated first in a pass
through the data, and so the results will depend on the data ordering.
The current implementation uses ball trees and kd-trees
to determine the neighborhood of points,
which avoids calculating the full distance matrix
(as was done in scikit-learn versions before 0.14).
The possibility to use custom metrics is retained;
for details, see NearestNeighbors.
Memory consumption for large sample sizes
This implementation is by default not memory efficient because it constructs
a full pairwise similarity matrix in the case where kd-trees or ball-trees cannot
be used (e.g., with sparse matrices). This matrix will consume \(n^2\) floats.
A couple of mechanisms for getting around this are:
Use OPTICS clustering in conjunction with the
extract_dbscan method. OPTICS clustering also calculates the full
pairwise matrix, but only keeps one row in memory at a time (memory
complexity n).
A sparse radius neighborhood graph (where missing entries are presumed to
be out of eps) can be precomputed in a memory-efficient way and dbscan
can be run over this with metric='precomputed'. See
sklearn.neighbors.NearestNeighbors.radius_neighbors_graph.
The dataset can be compressed, either by removing exact duplicates if
these occur in your data, or by using BIRCH. Then you only have a
relatively small number of representatives for a large number of points.
You can then provide a sample_weight when fitting DBSCAN.
References:
“A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases
with Noise”
Ester, M., H. P. Kriegel, J. Sander, and X. Xu,
In Proceedings of the 2nd International Conference on Knowledge Discovery
and Data Mining, Portland, OR, AAAI Press, pp. 226–231. 1996
“DBSCAN revisited, revisited: why and how you should (still) use DBSCAN.”
Schubert, E., Sander, J., Ester, M., Kriegel, H. P., & Xu, X. (2017).
In ACM Transactions on Database Systems (TODS), 42(3), 19.
2.3.8. HDBSCAN¶
The HDBSCAN algorithm can be seen as an extension of DBSCAN
and OPTICS. Specifically, DBSCAN assumes that the clustering
criterion (i.e. density requirement) is globally homogeneous.
In other words, DBSCAN may struggle to successfully capture clusters
with different densities.
HDBSCAN alleviates this assumption and explores all possible density
scales by building an alternative representation of the clustering problem.
Note
This implementation is adapted from the original implementation of HDBSCAN,
scikit-learn-contrib/hdbscan based on [LJ2017].
2.3.8.1. Mutual Reachability Graph¶
HDBSCAN first defines \(d_c(x_p)\), the core distance of a sample \(x_p\), as the
distance to its min_samples th-nearest neighbor, counting itself. For example,
if min_samples=5 and \(x_*\) is the 5th-nearest neighbor of \(x_p\)
then the core distance is:
\[d_c(x_p)=d(x_p, x_*).\]
Next it defines \(d_m(x_p, x_q)\), the mutual reachability distance of two points
\(x_p, x_q\), as:
\[d_m(x_p, x_q) = \max\{d_c(x_p), d_c(x_q), d(x_p, x_q)\}\]
These two notions allow us to construct the mutual reachability graph
\(G_{ms}\) defined for a fixed choice of min_samples by associating each
sample \(x_p\) with a vertex of the graph, and thus edges between points
\(x_p, x_q\) are the mutual reachability distance \(d_m(x_p, x_q)\)
between them. We may build subsets of this graph, denoted as
\(G_{ms,\varepsilon}\), by removing any edges with value greater than \(\varepsilon\):
from the original graph. Any points whose core distance is less than \(\varepsilon\):
are at this staged marked as noise. The remaining points are then clustered by
finding the connected components of this trimmed graph.
Note
Taking the connected components of a trimmed graph \(G_{ms,\varepsilon}\) is
equivalent to running DBSCAN* with min_samples and \(\varepsilon\). DBSCAN* is a
slightly modified version of DBSCAN mentioned in [CM2013].
2.3.8.2. Hierarchical Clustering¶
HDBSCAN can be seen as an algorithm which performs DBSCAN* clustering across all
values of \(\varepsilon\). As mentioned prior, this is equivalent to finding the connected
components of the mutual reachability graphs for all values of \(\varepsilon\). To do this
efficiently, HDBSCAN first extracts a minimum spanning tree (MST) from the fully
-connected mutual reachability graph, then greedily cuts the edges with highest
weight. An outline of the HDBSCAN algorithm is as follows:
Extract the MST of \(G_{ms}\).
Extend the MST by adding a “self edge” for each vertex, with weight equal
to the core distance of the underlying sample.
Initialize a single cluster and label for the MST.
Remove the edge with the greatest weight from the MST (ties are
removed simultaneously).
Assign cluster labels to the connected components which contain the
end points of the now-removed edge. If the component does not have at least
one edge it is instead assigned a “null” label marking it as noise.
Repeat 4-5 until there are no more connected components.
HDBSCAN is therefore able to obtain all possible partitions achievable by
DBSCAN* for a fixed choice of min_samples in a hierarchical fashion.
Indeed, this allows HDBSCAN to perform clustering across multiple densities
and as such it no longer needs \(\varepsilon\) to be given as a hyperparameter. Instead
it relies solely on the choice of min_samples, which tends to be a more robust
hyperparameter.
HDBSCAN can be smoothed with an additional hyperparameter min_cluster_size
which specifies that during the hierarchical clustering, components with fewer
than minimum_cluster_size many samples are considered noise. In practice, one
can set minimum_cluster_size = min_samples to couple the parameters and
simplify the hyperparameter space.
References:
[CM2013]
Campello, R.J.G.B., Moulavi, D., Sander, J. (2013). Density-Based Clustering
Based on Hierarchical Density Estimates. In: Pei, J., Tseng, V.S., Cao, L.,
Motoda, H., Xu, G. (eds) Advances in Knowledge Discovery and Data Mining.
PAKDD 2013. Lecture Notes in Computer Science(), vol 7819. Springer, Berlin,
Heidelberg.
Density-Based Clustering Based on Hierarchical Density Estimates
[LJ2017]
L. McInnes and J. Healy, (2017). Accelerated Hierarchical Density Based
Clustering. In: IEEE International Conference on Data Mining Workshops (ICDMW),
2017, pp. 33-42.
Accelerated Hierarchical Density Based Clustering
2.3.9. OPTICS¶
The OPTICS algorithm shares many similarities with the DBSCAN
algorithm, and can be considered a generalization of DBSCAN that relaxes the
eps requirement from a single value to a value range. The key difference
between DBSCAN and OPTICS is that the OPTICS algorithm builds a reachability
graph, which assigns each sample both a reachability_ distance, and a spot
within the cluster ordering_ attribute; these two attributes are assigned
when the model is fitted, and are used to determine cluster membership. If
OPTICS is run with the default value of inf set for max_eps, then DBSCAN
style cluster extraction can be performed repeatedly in linear time for any
given eps value using the cluster_optics_dbscan method. Setting
max_eps to a lower value will result in shorter run times, and can be
thought of as the maximum neighborhood radius from each point to find other
potential reachable points.
The reachability distances generated by OPTICS allow for variable density
extraction of clusters within a single data set. As shown in the above plot,
combining reachability distances and data set ordering_ produces a
reachability plot, where point density is represented on the Y-axis, and
points are ordered such that nearby points are adjacent. ‘Cutting’ the
reachability plot at a single value produces DBSCAN like results; all points
above the ‘cut’ are classified as noise, and each time that there is a break
when reading from left to right signifies a new cluster. The default cluster
extraction with OPTICS looks at the steep slopes within the graph to find
clusters, and the user can define what counts as a steep slope using the
parameter xi. There are also other possibilities for analysis on the graph
itself, such as generating hierarchical representations of the data through
reachability-plot dendrograms, and the hierarchy of clusters detected by the
algorithm can be accessed through the cluster_hierarchy_ parameter. The
plot above has been color-coded so that cluster colors in planar space match
the linear segment clusters of the reachability plot. Note that the blue and
red clusters are adjacent in the reachability plot, and can be hierarchically
represented as children of a larger parent cluster.
Examples:
Demo of OPTICS clustering algorithm
Comparison with DBSCAN
The results from OPTICS cluster_optics_dbscan method and DBSCAN are
very similar, but not always identical; specifically, labeling of periphery
and noise points. This is in part because the first samples of each dense
area processed by OPTICS have a large reachability value while being close
to other points in their area, and will thus sometimes be marked as noise
rather than periphery. This affects adjacent points when they are
considered as candidates for being marked as either periphery or noise.
Note that for any single value of eps, DBSCAN will tend to have a
shorter run time than OPTICS; however, for repeated runs at varying eps
values, a single run of OPTICS may require less cumulative runtime than
DBSCAN. It is also important to note that OPTICS’ output is close to
DBSCAN’s only if eps and max_eps are close.
Computational Complexity
Spatial indexing trees are used to avoid calculating the full distance
matrix, and allow for efficient memory usage on large sets of samples.
Different distance metrics can be supplied via the metric keyword.
For large datasets, similar (but not identical) results can be obtained via
HDBSCAN. The HDBSCAN implementation is
multithreaded, and has better algorithmic runtime complexity than OPTICS,
at the cost of worse memory scaling. For extremely large datasets that
exhaust system memory using HDBSCAN, OPTICS will maintain \(n\) (as opposed
to \(n^2\)) memory scaling; however, tuning of the max_eps parameter
will likely need to be used to give a solution in a reasonable amount of
wall time.
References:
“OPTICS: ordering points to identify the clustering structure.”
Ankerst, Mihael, Markus M. Breunig, Hans-Peter Kriegel, and Jörg Sander.
In ACM Sigmod Record, vol. 28, no. 2, pp. 49-60. ACM, 1999.
2.3.10. BIRCH¶
The Birch builds a tree called the Clustering Feature Tree (CFT)
for the given data. The data is essentially lossy compressed to a set of
Clustering Feature nodes (CF Nodes). The CF Nodes have a number of
subclusters called Clustering Feature subclusters (CF Subclusters)
and these CF Subclusters located in the non-terminal CF Nodes
can have CF Nodes as children.
The CF Subclusters hold the necessary information for clustering which prevents
the need to hold the entire input data in memory. This information includes:
Number of samples in a subcluster.
Linear Sum - An n-dimensional vector holding the sum of all samples
Squared Sum - Sum of the squared L2 norm of all samples.
Centroids - To avoid recalculation linear sum / n_samples.
Squared norm of the centroids.
The BIRCH algorithm has two parameters, the threshold and the branching factor.
The branching factor limits the number of subclusters in a node and the
threshold limits the distance between the entering sample and the existing
subclusters.
This algorithm can be viewed as an instance or data reduction method,
since it reduces the input data to a set of subclusters which are obtained directly
from the leaves of the CFT. This reduced data can be further processed by feeding
it into a global clusterer. This global clusterer can be set by n_clusters.
If n_clusters is set to None, the subclusters from the leaves are directly
read off, otherwise a global clustering step labels these subclusters into global
clusters (labels) and the samples are mapped to the global label of the nearest subcluster.
Algorithm description:
A new sample is inserted into the root of the CF Tree which is a CF Node.
It is then merged with the subcluster of the root, that has the smallest
radius after merging, constrained by the threshold and branching factor conditions.
If the subcluster has any child node, then this is done repeatedly till it reaches
a leaf. After finding the nearest subcluster in the leaf, the properties of this
subcluster and the parent subclusters are recursively updated.
If the radius of the subcluster obtained by merging the new sample and the
nearest subcluster is greater than the square of the threshold and if the
number of subclusters is greater than the branching factor, then a space is temporarily
allocated to this new sample. The two farthest subclusters are taken and
the subclusters are divided into two groups on the basis of the distance
between these subclusters.
If this split node has a parent subcluster and there is room
for a new subcluster, then the parent is split into two. If there is no room,
then this node is again split into two and the process is continued
recursively, till it reaches the root.
BIRCH or MiniBatchKMeans?
BIRCH does not scale very well to high dimensional data. As a rule of thumb if
n_features is greater than twenty, it is generally better to use MiniBatchKMeans.
If the number of instances of data needs to be reduced, or if one wants a
large number of subclusters either as a preprocessing step or otherwise,
BIRCH is more useful than MiniBatchKMeans.
How to use partial_fit?
To avoid the computation of global clustering, for every call of partial_fit
the user is advised
To set n_clusters=None initially
Train all data by multiple calls to partial_fit.
Set n_clusters to a required value using
brc.set_params(n_clusters=n_clusters).
Call partial_fit finally with no arguments, i.e. brc.partial_fit()
which performs the global clustering.
References:
Tian Zhang, Raghu Ramakrishnan, Maron Livny
BIRCH: An efficient data clustering method for large databases.
https://www.cs.sfu.ca/CourseCentral/459/han/papers/zhang96.pdf
Roberto Perdisci
JBirch - Java implementation of BIRCH clustering algorithm
https://code.google.com/archive/p/jbirch
2.3.11. Clustering performance evaluation¶
Evaluating the performance of a clustering algorithm is not as trivial as
counting the number of errors or the precision and recall of a supervised
classification algorithm. In particular any evaluation metric should not
take the absolute values of the cluster labels into account but rather
if this clustering define separations of the data similar to some ground
truth set of classes or satisfying some assumption such that members
belong to the same class are more similar than members of different
classes according to some similarity metric.
2.3.11.1. Rand index¶
Given the knowledge of the ground truth class assignments
labels_true and our clustering algorithm assignments of the same
samples labels_pred, the (adjusted or unadjusted) Rand index
is a function that measures the similarity of the two assignments,
ignoring permutations:
>>> from sklearn import metrics
>>> labels_true = [0, 0, 0, 1, 1, 1]
>>> labels_pred = [0, 0, 1, 1, 2, 2]
>>> metrics.rand_score(labels_true, labels_pred)
0.66...
The Rand index does not ensure to obtain a value close to 0.0 for a
random labelling. The adjusted Rand index corrects for chance and
will give such a baseline.
>>> metrics.adjusted_rand_score(labels_true, labels_pred)
0.24...
As with all clustering metrics, one can permute 0 and 1 in the predicted
labels, rename 2 to 3, and get the same score:
>>> labels_pred = [1, 1, 0, 0, 3, 3]
>>> metrics.rand_score(labels_true, labels_pred)
0.66...
>>> metrics.adjusted_rand_score(labels_true, labels_pred)
0.24...
Furthermore, both rand_score adjusted_rand_score are
symmetric: swapping the argument does not change the scores. They can
thus be used as consensus measures:
>>> metrics.rand_score(labels_pred, labels_true)
0.66...
>>> metrics.adjusted_rand_score(labels_pred, labels_true)
0.24...
Perfect labeling is scored 1.0:
>>> labels_pred = labels_true[:]
>>> metrics.rand_score(labels_true, labels_pred)
1.0
>>> metrics.adjusted_rand_score(labels_true, labels_pred)
1.0
Poorly agreeing labels (e.g. independent labelings) have lower scores,
and for the adjusted Rand index the score will be negative or close to
zero. However, for the unadjusted Rand index the score, while lower,
will not necessarily be close to zero.:
>>> labels_true = [0, 0, 0, 0, 0, 0, 1, 1]
>>> labels_pred = [0, 1, 2, 3, 4, 5, 5, 6]
>>> metrics.rand_score(labels_true, labels_pred)
0.39...
>>> metrics.adjusted_rand_score(labels_true, labels_pred)
-0.07...
2.3.11.1.1. Advantages¶
Interpretability: The unadjusted Rand index is proportional
to the number of sample pairs whose labels are the same in both
labels_pred and labels_true, or are different in both.
Random (uniform) label assignments have an adjusted Rand index
score close to 0.0 for any value of n_clusters and
n_samples (which is not the case for the unadjusted Rand index
or the V-measure for instance).
Bounded range: Lower values indicate different labelings,
similar clusterings have a high (adjusted or unadjusted) Rand index,
1.0 is the perfect match score. The score range is [0, 1] for the
unadjusted Rand index and [-1, 1] for the adjusted Rand index.
No assumption is made on the cluster structure: The (adjusted or
unadjusted) Rand index can be used to compare all kinds of
clustering algorithms, and can be used to compare clustering
algorithms such as k-means which assumes isotropic blob shapes with
results of spectral clustering algorithms which can find cluster
with “folded” shapes.
2.3.11.1.2. Drawbacks¶
Contrary to inertia, the (adjusted or unadjusted) Rand index
requires knowledge of the ground truth classes which is almost
never available in practice or requires manual assignment by human
annotators (as in the supervised learning setting).
However (adjusted or unadjusted) Rand index can also be useful in a
purely unsupervised setting as a building block for a Consensus
Index that can be used for clustering model selection (TODO).
The unadjusted Rand index is often close to 1.0 even if the
clusterings themselves differ significantly. This can be understood
when interpreting the Rand index as the accuracy of element pair
labeling resulting from the clusterings: In practice there often is
a majority of element pairs that are assigned the different pair
label under both the predicted and the ground truth clustering
resulting in a high proportion of pair labels that agree, which
leads subsequently to a high score.
Examples:
Adjustment for chance in clustering performance evaluation:
Analysis of the impact of the dataset size on the value of
clustering measures for random assignments.
2.3.11.1.3. Mathematical formulation¶
If C is a ground truth class assignment and K the clustering, let us
define \(a\) and \(b\) as:
\(a\), the number of pairs of elements that are in the same set
in C and in the same set in K
\(b\), the number of pairs of elements that are in different sets
in C and in different sets in K
The unadjusted Rand index is then given by:
\[\text{RI} = \frac{a + b}{C_2^{n_{samples}}}\]
where \(C_2^{n_{samples}}\) is the total number of possible pairs
in the dataset. It does not matter if the calculation is performed on
ordered pairs or unordered pairs as long as the calculation is
performed consistently.
However, the Rand index does not guarantee that random label assignments
will get a value close to zero (esp. if the number of clusters is in
the same order of magnitude as the number of samples).
To counter this effect we can discount the expected RI \(E[\text{RI}]\) of
random labelings by defining the adjusted Rand index as follows:
\[\text{ARI} = \frac{\text{RI} - E[\text{RI}]}{\max(\text{RI}) - E[\text{RI}]}\]
References
Comparing Partitions
L. Hubert and P. Arabie, Journal of Classification 1985
Properties of the Hubert-Arabie adjusted Rand index
D. Steinley, Psychological Methods 2004
Wikipedia entry for the Rand index
Wikipedia entry for the adjusted Rand index
2.3.11.2. Mutual Information based scores¶
Given the knowledge of the ground truth class assignments labels_true and
our clustering algorithm assignments of the same samples labels_pred, the
Mutual Information is a function that measures the agreement of the two
assignments, ignoring permutations. Two different normalized versions of this
measure are available, Normalized Mutual Information (NMI) and Adjusted
Mutual Information (AMI). NMI is often used in the literature, while AMI was
proposed more recently and is normalized against chance:
>>> from sklearn import metrics
>>> labels_true = [0, 0, 0, 1, 1, 1]
>>> labels_pred = [0, 0, 1, 1, 2, 2]
>>> metrics.adjusted_mutual_info_score(labels_true, labels_pred)
0.22504...
One can permute 0 and 1 in the predicted labels, rename 2 to 3 and get
the same score:
>>> labels_pred = [1, 1, 0, 0, 3, 3]
>>> metrics.adjusted_mutual_info_score(labels_true, labels_pred)
0.22504...
All, mutual_info_score, adjusted_mutual_info_score and
normalized_mutual_info_score are symmetric: swapping the argument does
not change the score. Thus they can be used as a consensus measure:
>>> metrics.adjusted_mutual_info_score(labels_pred, labels_true)
0.22504...
Perfect labeling is scored 1.0:
>>> labels_pred = labels_true[:]
>>> metrics.adjusted_mutual_info_score(labels_true, labels_pred)
1.0
>>> metrics.normalized_mutual_info_score(labels_true, labels_pred)
1.0
This is not true for mutual_info_score, which is therefore harder to judge:
>>> metrics.mutual_info_score(labels_true, labels_pred)
0.69...
Bad (e.g. independent labelings) have non-positive scores:
>>> labels_true = [0, 1, 2, 0, 3, 4, 5, 1]
>>> labels_pred = [1, 1, 0, 0, 2, 2, 2, 2]
>>> metrics.adjusted_mutual_info_score(labels_true, labels_pred)
-0.10526...
2.3.11.2.1. Advantages¶
Random (uniform) label assignments have a AMI score close to 0.0
for any value of n_clusters and n_samples (which is not the
case for raw Mutual Information or the V-measure for instance).
Upper bound of 1: Values close to zero indicate two label
assignments that are largely independent, while values close to one
indicate significant agreement. Further, an AMI of exactly 1 indicates
that the two label assignments are equal (with or without permutation).
2.3.11.2.2. Drawbacks¶
Contrary to inertia, MI-based measures require the knowledge
of the ground truth classes while almost never available in practice or
requires manual assignment by human annotators (as in the supervised learning
setting).
However MI-based measures can also be useful in purely unsupervised setting as a
building block for a Consensus Index that can be used for clustering
model selection.
NMI and MI are not adjusted against chance.
Examples:
Adjustment for chance in clustering performance evaluation: Analysis of
the impact of the dataset size on the value of clustering measures
for random assignments. This example also includes the Adjusted Rand
Index.
2.3.11.2.3. Mathematical formulation¶
Assume two label assignments (of the same N objects), \(U\) and \(V\).
Their entropy is the amount of uncertainty for a partition set, defined by:
\[H(U) = - \sum_{i=1}^{|U|}P(i)\log(P(i))\]
where \(P(i) = |U_i| / N\) is the probability that an object picked at
random from \(U\) falls into class \(U_i\). Likewise for \(V\):
\[H(V) = - \sum_{j=1}^{|V|}P'(j)\log(P'(j))\]
With \(P'(j) = |V_j| / N\). The mutual information (MI) between \(U\)
and \(V\) is calculated by:
\[\text{MI}(U, V) = \sum_{i=1}^{|U|}\sum_{j=1}^{|V|}P(i, j)\log\left(\frac{P(i,j)}{P(i)P'(j)}\right)\]
where \(P(i, j) = |U_i \cap V_j| / N\) is the probability that an object
picked at random falls into both classes \(U_i\) and \(V_j\).
It also can be expressed in set cardinality formulation:
\[\text{MI}(U, V) = \sum_{i=1}^{|U|} \sum_{j=1}^{|V|} \frac{|U_i \cap V_j|}{N}\log\left(\frac{N|U_i \cap V_j|}{|U_i||V_j|}\right)\]
The normalized mutual information is defined as
\[\text{NMI}(U, V) = \frac{\text{MI}(U, V)}{\text{mean}(H(U), H(V))}\]
This value of the mutual information and also the normalized variant is not
adjusted for chance and will tend to increase as the number of different labels
(clusters) increases, regardless of the actual amount of “mutual information”
between the label assignments.
The expected value for the mutual information can be calculated using the
following equation [VEB2009]. In this equation,
\(a_i = |U_i|\) (the number of elements in \(U_i\)) and
\(b_j = |V_j|\) (the number of elements in \(V_j\)).
\[E[\text{MI}(U,V)]=\sum_{i=1}^{|U|} \sum_{j=1}^{|V|} \sum_{n_{ij}=(a_i+b_j-N)^+
}^{\min(a_i, b_j)} \frac{n_{ij}}{N}\log \left( \frac{ N.n_{ij}}{a_i b_j}\right)
\frac{a_i!b_j!(N-a_i)!(N-b_j)!}{N!n_{ij}!(a_i-n_{ij})!(b_j-n_{ij})!
(N-a_i-b_j+n_{ij})!}\]
Using the expected value, the adjusted mutual information can then be
calculated using a similar form to that of the adjusted Rand index:
\[\text{AMI} = \frac{\text{MI} - E[\text{MI}]}{\text{mean}(H(U), H(V)) - E[\text{MI}]}\]
For normalized mutual information and adjusted mutual information, the normalizing
value is typically some generalized mean of the entropies of each clustering.
Various generalized means exist, and no firm rules exist for preferring one over the
others. The decision is largely a field-by-field basis; for instance, in community
detection, the arithmetic mean is most common. Each
normalizing method provides “qualitatively similar behaviours” [YAT2016]. In our
implementation, this is controlled by the average_method parameter.
Vinh et al. (2010) named variants of NMI and AMI by their averaging method [VEB2010]. Their
‘sqrt’ and ‘sum’ averages are the geometric and arithmetic means; we use these
more broadly common names.
References
Strehl, Alexander, and Joydeep Ghosh (2002). “Cluster ensembles – a
knowledge reuse framework for combining multiple partitions”. Journal of
Machine Learning Research 3: 583–617.
doi:10.1162/153244303321897735.
Wikipedia entry for the (normalized) Mutual Information
Wikipedia entry for the Adjusted Mutual Information
[VEB2009]
Vinh, Epps, and Bailey, (2009). “Information theoretic measures
for clusterings comparison”. Proceedings of the 26th Annual International
Conference on Machine Learning - ICML ‘09.
doi:10.1145/1553374.1553511.
ISBN 9781605585161.
[VEB2010]
Vinh, Epps, and Bailey, (2010). “Information Theoretic Measures for
Clusterings Comparison: Variants, Properties, Normalization and
Correction for Chance”. JMLR
[YAT2016]
Yang, Algesheimer, and Tessone, (2016). “A comparative analysis of
community
detection algorithms on artificial networks”. Scientific Reports 6: 30750.
doi:10.1038/srep30750.
2.3.11.3. Homogeneity, completeness and V-measure¶
Given the knowledge of the ground truth class assignments of the samples,
it is possible to define some intuitive metric using conditional entropy
analysis.
In particular Rosenberg and Hirschberg (2007) define the following two
desirable objectives for any cluster assignment:
homogeneity: each cluster contains only members of a single class.
completeness: all members of a given class are assigned to the same
cluster.
We can turn those concept as scores homogeneity_score and
completeness_score. Both are bounded below by 0.0 and above by
1.0 (higher is better):
>>> from sklearn import metrics
>>> labels_true = [0, 0, 0, 1, 1, 1]
>>> labels_pred = [0, 0, 1, 1, 2, 2]
>>> metrics.homogeneity_score(labels_true, labels_pred)
0.66...
>>> metrics.completeness_score(labels_true, labels_pred)
0.42...
Their harmonic mean called V-measure is computed by
v_measure_score:
>>> metrics.v_measure_score(labels_true, labels_pred)
0.51...
This function’s formula is as follows:
\[v = \frac{(1 + \beta) \times \text{homogeneity} \times \text{completeness}}{(\beta \times \text{homogeneity} + \text{completeness})}\]
beta defaults to a value of 1.0, but for using a value less than 1 for beta:
>>> metrics.v_measure_score(labels_true, labels_pred, beta=0.6)
0.54...
more weight will be attributed to homogeneity, and using a value greater than 1:
>>> metrics.v_measure_score(labels_true, labels_pred, beta=1.8)
0.48...
more weight will be attributed to completeness.
The V-measure is actually equivalent to the mutual information (NMI)
discussed above, with the aggregation function being the arithmetic mean [B2011].
Homogeneity, completeness and V-measure can be computed at once using
homogeneity_completeness_v_measure as follows:
>>> metrics.homogeneity_completeness_v_measure(labels_true, labels_pred)
(0.66..., 0.42..., 0.51...)
The following clustering assignment is slightly better, since it is
homogeneous but not complete:
>>> labels_pred = [0, 0, 0, 1, 2, 2]
>>> metrics.homogeneity_completeness_v_measure(labels_true, labels_pred)
(1.0, 0.68..., 0.81...)
Note
v_measure_score is symmetric: it can be used to evaluate
the agreement of two independent assignments on the same dataset.
This is not the case for completeness_score and
homogeneity_score: both are bound by the relationship:
homogeneity_score(a, b) == completeness_score(b, a)
2.3.11.3.1. Advantages¶
Bounded scores: 0.0 is as bad as it can be, 1.0 is a perfect score.
Intuitive interpretation: clustering with bad V-measure can be
qualitatively analyzed in terms of homogeneity and completeness
to better feel what ‘kind’ of mistakes is done by the assignment.
No assumption is made on the cluster structure: can be used
to compare clustering algorithms such as k-means which assumes isotropic
blob shapes with results of spectral clustering algorithms which can
find cluster with “folded” shapes.
2.3.11.3.2. Drawbacks¶
The previously introduced metrics are not normalized with regards to
random labeling: this means that depending on the number of samples,
clusters and ground truth classes, a completely random labeling will
not always yield the same values for homogeneity, completeness and
hence v-measure. In particular random labeling won’t yield zero
scores especially when the number of clusters is large.
This problem can safely be ignored when the number of samples is more
than a thousand and the number of clusters is less than 10. For
smaller sample sizes or larger number of clusters it is safer to use
an adjusted index such as the Adjusted Rand Index (ARI).
These metrics require the knowledge of the ground truth classes while
almost never available in practice or requires manual assignment by
human annotators (as in the supervised learning setting).
Examples:
Adjustment for chance in clustering performance evaluation: Analysis of
the impact of the dataset size on the value of clustering measures
for random assignments.
2.3.11.3.3. Mathematical formulation¶
Homogeneity and completeness scores are formally given by:
\[h = 1 - \frac{H(C|K)}{H(C)}\]
\[c = 1 - \frac{H(K|C)}{H(K)}\]
where \(H(C|K)\) is the conditional entropy of the classes given
the cluster assignments and is given by:
\[H(C|K) = - \sum_{c=1}^{|C|} \sum_{k=1}^{|K|} \frac{n_{c,k}}{n}
\cdot \log\left(\frac{n_{c,k}}{n_k}\right)\]
and \(H(C)\) is the entropy of the classes and is given by:
\[H(C) = - \sum_{c=1}^{|C|} \frac{n_c}{n} \cdot \log\left(\frac{n_c}{n}\right)\]
with \(n\) the total number of samples, \(n_c\) and \(n_k\)
the number of samples respectively belonging to class \(c\) and
cluster \(k\), and finally \(n_{c,k}\) the number of samples
from class \(c\) assigned to cluster \(k\).
The conditional entropy of clusters given class \(H(K|C)\) and the
entropy of clusters \(H(K)\) are defined in a symmetric manner.
Rosenberg and Hirschberg further define V-measure as the harmonic
mean of homogeneity and completeness:
\[v = 2 \cdot \frac{h \cdot c}{h + c}\]
References
V-Measure: A conditional entropy-based external cluster evaluation
measure
Andrew Rosenberg and Julia Hirschberg, 2007
[B2011]
Identification and Characterization of Events in Social Media, Hila
Becker, PhD Thesis.
2.3.11.4. Fowlkes-Mallows scores¶
The Fowlkes-Mallows index (sklearn.metrics.fowlkes_mallows_score) can be
used when the ground truth class assignments of the samples is known. The
Fowlkes-Mallows score FMI is defined as the geometric mean of the
pairwise precision and recall:
\[\text{FMI} = \frac{\text{TP}}{\sqrt{(\text{TP} + \text{FP}) (\text{TP} + \text{FN})}}\]
Where TP is the number of True Positive (i.e. the number of pair
of points that belong to the same clusters in both the true labels and the
predicted labels), FP is the number of False Positive (i.e. the number
of pair of points that belong to the same clusters in the true labels and not
in the predicted labels) and FN is the number of False Negative (i.e. the
number of pair of points that belongs in the same clusters in the predicted
labels and not in the true labels).
The score ranges from 0 to 1. A high value indicates a good similarity
between two clusters.
>>> from sklearn import metrics
>>> labels_true = [0, 0, 0, 1, 1, 1]
>>> labels_pred = [0, 0, 1, 1, 2, 2]
>>> metrics.fowlkes_mallows_score(labels_true, labels_pred)
0.47140...
One can permute 0 and 1 in the predicted labels, rename 2 to 3 and get
the same score:
>>> labels_pred = [1, 1, 0, 0, 3, 3]
>>> metrics.fowlkes_mallows_score(labels_true, labels_pred)
0.47140...
Perfect labeling is scored 1.0:
>>> labels_pred = labels_true[:]
>>> metrics.fowlkes_mallows_score(labels_true, labels_pred)
1.0
Bad (e.g. independent labelings) have zero scores:
>>> labels_true = [0, 1, 2, 0, 3, 4, 5, 1]
>>> labels_pred = [1, 1, 0, 0, 2, 2, 2, 2]
>>> metrics.fowlkes_mallows_score(labels_true, labels_pred)
0.0
2.3.11.4.1. Advantages¶
Random (uniform) label assignments have a FMI score close to 0.0
for any value of n_clusters and n_samples (which is not the
case for raw Mutual Information or the V-measure for instance).
Upper-bounded at 1: Values close to zero indicate two label
assignments that are largely independent, while values close to one
indicate significant agreement. Further, values of exactly 0 indicate
purely independent label assignments and a FMI of exactly 1 indicates
that the two label assignments are equal (with or without permutation).
No assumption is made on the cluster structure: can be used
to compare clustering algorithms such as k-means which assumes isotropic
blob shapes with results of spectral clustering algorithms which can
find cluster with “folded” shapes.
2.3.11.4.2. Drawbacks¶
Contrary to inertia, FMI-based measures require the knowledge
of the ground truth classes while almost never available in practice or
requires manual assignment by human annotators (as in the supervised learning
setting).
References
E. B. Fowkles and C. L. Mallows, 1983. “A method for comparing two
hierarchical clusterings”. Journal of the American Statistical Association.
https://www.tandfonline.com/doi/abs/10.1080/01621459.1983.10478008
Wikipedia entry for the Fowlkes-Mallows Index
2.3.11.5. Silhouette Coefficient¶
If the ground truth labels are not known, evaluation must be performed using
the model itself. The Silhouette Coefficient
(sklearn.metrics.silhouette_score)
is an example of such an evaluation, where a
higher Silhouette Coefficient score relates to a model with better defined
clusters. The Silhouette Coefficient is defined for each sample and is composed
of two scores:
a: The mean distance between a sample and all other points in the same
class.
b: The mean distance between a sample and all other points in the next
nearest cluster.
The Silhouette Coefficient s for a single sample is then given as:
\[s = \frac{b - a}{max(a, b)}\]
The Silhouette Coefficient for a set of samples is given as the mean of the
Silhouette Coefficient for each sample.
>>> from sklearn import metrics
>>> from sklearn.metrics import pairwise_distances
>>> from sklearn import datasets
>>> X, y = datasets.load_iris(return_X_y=True)
In normal usage, the Silhouette Coefficient is applied to the results of a
cluster analysis.
>>> import numpy as np
>>> from sklearn.cluster import KMeans
>>> kmeans_model = KMeans(n_clusters=3, random_state=1).fit(X)
>>> labels = kmeans_model.labels_
>>> metrics.silhouette_score(X, labels, metric='euclidean')
0.55...
References
Peter J. Rousseeuw (1987). “Silhouettes: a Graphical Aid to the
Interpretation and Validation of Cluster Analysis”
. Computational and Applied Mathematics 20: 53–65.
2.3.11.5.1. Advantages¶
The score is bounded between -1 for incorrect clustering and +1 for highly
dense clustering. Scores around zero indicate overlapping clusters.
The score is higher when clusters are dense and well separated, which relates
to a standard concept of a cluster.
2.3.11.5.2. Drawbacks¶
The Silhouette Coefficient is generally higher for convex clusters than other
concepts of clusters, such as density based clusters like those obtained
through DBSCAN.
Examples:
Selecting the number of clusters with silhouette analysis on KMeans clustering : In this example
the silhouette analysis is used to choose an optimal value for n_clusters.
2.3.11.6. Calinski-Harabasz Index¶
If the ground truth labels are not known, the Calinski-Harabasz index
(sklearn.metrics.calinski_harabasz_score) - also known as the Variance
Ratio Criterion - can be used to evaluate the model, where a higher
Calinski-Harabasz score relates to a model with better defined clusters.
The index is the ratio of the sum of between-clusters dispersion and of
within-cluster dispersion for all clusters (where dispersion is defined as the
sum of distances squared):
>>> from sklearn import metrics
>>> from sklearn.metrics import pairwise_distances
>>> from sklearn import datasets
>>> X, y = datasets.load_iris(return_X_y=True)
In normal usage, the Calinski-Harabasz index is applied to the results of a
cluster analysis:
>>> import numpy as np
>>> from sklearn.cluster import KMeans
>>> kmeans_model = KMeans(n_clusters=3, random_state=1).fit(X)
>>> labels = kmeans_model.labels_
>>> metrics.calinski_harabasz_score(X, labels)
561.59...
2.3.11.6.1. Advantages¶
The score is higher when clusters are dense and well separated, which relates
to a standard concept of a cluster.
The score is fast to compute.
2.3.11.6.2. Drawbacks¶
The Calinski-Harabasz index is generally higher for convex clusters than other
concepts of clusters, such as density based clusters like those obtained
through DBSCAN.
2.3.11.6.3. Mathematical formulation¶
For a set of data \(E\) of size \(n_E\) which has been clustered into
\(k\) clusters, the Calinski-Harabasz score \(s\) is defined as the
ratio of the between-clusters dispersion mean and the within-cluster dispersion:
\[s = \frac{\mathrm{tr}(B_k)}{\mathrm{tr}(W_k)} \times \frac{n_E - k}{k - 1}\]
where \(\mathrm{tr}(B_k)\) is trace of the between group dispersion matrix
and \(\mathrm{tr}(W_k)\) is the trace of the within-cluster dispersion
matrix defined by:
\[W_k = \sum_{q=1}^k \sum_{x \in C_q} (x - c_q) (x - c_q)^T\]
\[B_k = \sum_{q=1}^k n_q (c_q - c_E) (c_q - c_E)^T\]
with \(C_q\) the set of points in cluster \(q\), \(c_q\) the center
of cluster \(q\), \(c_E\) the center of \(E\), and \(n_q\) the
number of points in cluster \(q\).
References
Caliński, T., & Harabasz, J. (1974).
“A Dendrite Method for Cluster Analysis”.
Communications in Statistics-theory and Methods 3: 1-27.
2.3.11.7. Davies-Bouldin Index¶
If the ground truth labels are not known, the Davies-Bouldin index
(sklearn.metrics.davies_bouldin_score) can be used to evaluate the
model, where a lower Davies-Bouldin index relates to a model with better
separation between the clusters.
This index signifies the average ‘similarity’ between clusters, where the
similarity is a measure that compares the distance between clusters with the
size of the clusters themselves.
Zero is the lowest possible score. Values closer to zero indicate a better
partition.
In normal usage, the Davies-Bouldin index is applied to the results of a
cluster analysis as follows:
>>> from sklearn import datasets
>>> iris = datasets.load_iris()
>>> X = iris.data
>>> from sklearn.cluster import KMeans
>>> from sklearn.metrics import davies_bouldin_score
>>> kmeans = KMeans(n_clusters=3, random_state=1).fit(X)
>>> labels = kmeans.labels_
>>> davies_bouldin_score(X, labels)
0.666...
2.3.11.7.1. Advantages¶
The computation of Davies-Bouldin is simpler than that of Silhouette scores.
The index is solely based on quantities and features inherent to the dataset
as its computation only uses point-wise distances.
2.3.11.7.2. Drawbacks¶
The Davies-Boulding index is generally higher for convex clusters than other
concepts of clusters, such as density based clusters like those obtained from
DBSCAN.
The usage of centroid distance limits the distance metric to Euclidean space.
2.3.11.7.3. Mathematical formulation¶
The index is defined as the average similarity between each cluster \(C_i\)
for \(i=1, ..., k\) and its most similar one \(C_j\). In the context of
this index, similarity is defined as a measure \(R_{ij}\) that trades off:
\(s_i\), the average distance between each point of cluster \(i\) and
the centroid of that cluster – also know as cluster diameter.
\(d_{ij}\), the distance between cluster centroids \(i\) and \(j\).
A simple choice to construct \(R_{ij}\) so that it is nonnegative and
symmetric is:
\[R_{ij} = \frac{s_i + s_j}{d_{ij}}\]
Then the Davies-Bouldin index is defined as:
\[DB = \frac{1}{k} \sum_{i=1}^k \max_{i \neq j} R_{ij}\]
References
Davies, David L.; Bouldin, Donald W. (1979).
“A Cluster Separation Measure”
IEEE Transactions on Pattern Analysis and Machine Intelligence.
PAMI-1 (2): 224-227.
Halkidi, Maria; Batistakis, Yannis; Vazirgiannis, Michalis (2001).
“On Clustering Validation Techniques”
Journal of Intelligent Information Systems, 17(2-3), 107-145.
Wikipedia entry for Davies-Bouldin index.
2.3.11.8. Contingency Matrix¶
Contingency matrix (sklearn.metrics.cluster.contingency_matrix)
reports the intersection cardinality for every true/predicted cluster pair.
The contingency matrix provides sufficient statistics for all clustering
metrics where the samples are independent and identically distributed and
one doesn’t need to account for some instances not being clustered.
Here is an example:
>>> from sklearn.metrics.cluster import contingency_matrix
>>> x = ["a", "a", "a", "b", "b", "b"]
>>> y = [0, 0, 1, 1, 2, 2]
>>> contingency_matrix(x, y)
array([[2, 1, 0],
[0, 1, 2]])
The first row of output array indicates that there are three samples whose
true cluster is “a”. Of them, two are in predicted cluster 0, one is in 1,
and none is in 2. And the second row indicates that there are three samples
whose true cluster is “b”. Of them, none is in predicted cluster 0, one is in
1 and two are in 2.
A confusion matrix for classification is a square
contingency matrix where the order of rows and columns correspond to a list
of classes.
2.3.11.8.1. Advantages¶
Allows to examine the spread of each true cluster across predicted
clusters and vice versa.
The contingency table calculated is typically utilized in the calculation
of a similarity statistic (like the others listed in this document) between
the two clusterings.
2.3.11.8.2. Drawbacks¶
Contingency matrix is easy to interpret for a small number of clusters, but
becomes very hard to interpret for a large number of clusters.
It doesn’t give a single metric to use as an objective for clustering
optimisation.
References
Wikipedia entry for contingency matrix
2.3.11.9. Pair Confusion Matrix¶
The pair confusion matrix
(sklearn.metrics.cluster.pair_confusion_matrix) is a 2x2
similarity matrix
\[\begin{split}C = \left[\begin{matrix}
C_{00} & C_{01} \\
C_{10} & C_{11}
\end{matrix}\right]\end{split}\]
between two clusterings computed by considering all pairs of samples and
counting pairs that are assigned into the same or into different clusters
under the true and predicted clusterings.
It has the following entries:
\(C_{00}\) : number of pairs with both clusterings having the samples
not clustered together
\(C_{10}\) : number of pairs with the true label clustering having the
samples clustered together but the other clustering not having the samples
clustered together
\(C_{01}\) : number of pairs with the true label clustering not having
the samples clustered together but the other clustering having the samples
clustered together
\(C_{11}\) : number of pairs with both clusterings having the samples
clustered together
Considering a pair of samples that is clustered together a positive pair,
then as in binary classification the count of true negatives is
\(C_{00}\), false negatives is \(C_{10}\), true positives is
\(C_{11}\) and false positives is \(C_{01}\).
Perfectly matching labelings have all non-zero entries on the
diagonal regardless of actual label values:
>>> from sklearn.metrics.cluster import pair_confusion_matrix
>>> pair_confusion_matrix([0, 0, 1, 1], [0, 0, 1, 1])
array([[8, 0],
[0, 4]])
>>> pair_confusion_matrix([0, 0, 1, 1], [1, 1, 0, 0])
array([[8, 0],
[0, 4]])
Labelings that assign all classes members to the same clusters
are complete but may not always be pure, hence penalized, and
have some off-diagonal non-zero entries:
>>> pair_confusion_matrix([0, 0, 1, 2], [0, 0, 1, 1])
array([[8, 2],
[0, 2]])
The matrix is not symmetric:
>>> pair_confusion_matrix([0, 0, 1, 1], [0, 0, 1, 2])
array([[8, 0],
[2, 2]])
If classes members are completely split across different clusters, the
assignment is totally incomplete, hence the matrix has all zero
diagonal entries:
>>> pair_confusion_matrix([0, 0, 0, 0], [0, 1, 2, 3])
array([[ 0, 0],
[12, 0]])
References
“Comparing Partitions”
L. Hubert and P. Arabie, Journal of Classification 1985
© 2007 - 2024, scikit-learn developers (BSD License).
Show this page source
sklearn
sklearn
sklearn
sklearn 中文文档
目录
作者
整理
校招巴士
安装 scikit-learn
1. 监督学习
1.0 监督学习
1.1. 广义线性模型
1.2. 线性和二次判别分析
1.3. 内核岭回归
1.4. 支持向量机
1.5. 随机梯度下降
1.6. 最近邻
1.7. 高斯过程
1.8. 交叉分解
1.9. 朴素贝叶斯
1.10. 决策树
1.11. 集成方法
1.12. 多类和多标签算法
1.13. 特征选择
1.14. 半监督学习
1.15. 等式回归
1.16. 概率校准
1.17. 神经网络模型(有监督)
2. 无监督学习
2.0 无监督学习
2.1. 高斯混合模型
2.2. 流形学习
2.3. 聚类
2.4. 双聚类
2.5. 分解成分中的信号(矩阵分解问题)
2.6. 协方差估计
2.7. 新奇和异常值检测
2.8. 密度估计
2.9. 神经网络模型(无监督)
3. 模型选择和评估
3.0 模型选择和评估
3.1. 交叉验证:评估估算器的表现
3.2. 调整估计器的超参数
3.3. 模型评估: 量化预测的质量
3.4. 模型持久化
3.5. 验证曲线: 绘制分数以评估模型
4. 检验
4.0 检验
4.1. 部分依赖图
5. 数据集转换
5.0 数据集转换
5.1. Pipeline(管道)和 FeatureUnion(特征联合): 合并的评估器
5.2. 特征提取
5.3 预处理数据
5.4 缺失值插补
5.5. 无监督降维
5.6. 随机投影
5.7. 内核近似
5.8. 成对的矩阵, 类别和核函数
5.9. 预测目标
6. 数据集加载工具
6.0 数据集加载工具
6.1. 通用数据集 API
6.2. 玩具数据集
6.3 真实世界中的数据集
6.4. 样本生成器
6.5. 加载其他数据集
7. 使用scikit-learn计算
7.0 使用scikit-learn计算
7.1. 大规模计算的策略: 更大量的数据
7.2. 计算性能
7.3. 并行性、资源管理和配置
教程
使用 scikit-learn 介绍机器学习
关于科学数据处理的统计学习教程
关于科学数据处理的统计学习教程
机器学习: scikit-learn 中的设置以及预估对象
监督学习:从高维观察预测输出变量
模型选择:选择估计量及其参数
无监督学习: 寻求数据表示
把它们放在一起
寻求帮助
处理文本数据
选择正确的评估器(estimator.md
外部资源,视频和谈话
API 参考
常见问题
时光轴
sklearn
Docs »
sklearn 中文文档
sklearn 简介
scikit-learn 是基于 Python 语言的机器学习工具
简单高效的数据挖掘和数据分析工具
可供大家在各种环境中重复使用
建立在 NumPy ,SciPy 和 matplotlib 上
开源,可商业使用 - BSD许可证
点击下载OpenCV最新中文官方文档pdf版
目录
安装 scikit-learn
用户指南
1. 监督学习
1.1. 广义线性模型
1.2. 线性和二次判别分析
1.3. 内核岭回归
1.4. 支持向量机
1.5. 随机梯度下降
1.6. 最近邻
1.7. 高斯过程
1.8. 交叉分解
1.9. 朴素贝叶斯
1.10. 决策树
1.11. 集成方法
1.12. 多类和多标签算法
1.13. 特征选择
1.14. 半监督学习
1.15. 等式回归
1.16. 概率校准
1.17. 神经网络模型(有监督)
2. 无监督学习
2.1. 高斯混合模型
2.2. 流形学习
2.3. 聚类
2.4. 双聚类
2.5. 分解成分中的信号(矩阵分解问题)
2.6. 协方差估计
2.7. 新奇和异常值检测
2.8. 密度估计
2.9. 神经网络模型(无监督)
3. 模型选择和评估
3.1. 交叉验证:评估估算器的表现
3.2. 调整估计器的超参数
3.3. 模型评估: 量化预测的质量
3.4. 模型持久化
3.5. 验证曲线: 绘制分数以评估模型
4. 检验
4.1. 部分依赖图
5. 数据集转换
5.1. Pipeline(管道)和 FeatureUnion(特征联合): 合并的评估器
5.2. 特征提取
5.3 预处理数据
5.4 缺失值插补
5.5. 无监督降维
5.6. 随机投影
5.7. 内核近似
5.8. 成对的矩阵, 类别和核函数
5.9. 预测目标 (y) 的转换
6. 数据集加载工具
6.1. 通用数据集 API
6.2. 玩具数据集
6.3 真实世界中的数据集
6.4. 样本生成器
6.5. 加载其他数据集
7. 使用scikit-learn计算
7.1. 大规模计算的策略: 更大量的数据
7.2. 计算性能
7.3. 并行性、资源管理和配置
教程
使用 scikit-learn 介绍机器学习
关于科学数据处理的统计学习教程
机器学习: scikit-learn 中的设置以及预估对象
监督学习:从高维观察预测输出变量
模型选择:选择估计量及其参数
无监督学习: 寻求数据表示
把它们放在一起
寻求帮助
处理文本数据
选择正确的评估器(estimator/)
外部资源,视频和谈话
API 参考
常见问题
时光轴
作者
sklearn-doc-zh:https://github.com/apachecn/sklearn-doc-zh
整理
http://scikitlearn.com.cn/
校招巴士
校招巴士
校招巴士网站一个专注于大学生校招求职的平台!旨在分享互联网大厂内推、校招资讯、面经笔经、职场干货、简历技巧等,助力百万大学生校招求职!
Next
sklearn PythonOK 协议:CC BY-NC-SA 4.0
Built with MkDocs using a theme provided by Read the Docs.
Next »
scikit-learn (sklearn) 官方文档中文版 - sklearn
scikit-learn (sklearn) 官方文档中文版 - sklearn
sklearn
sklearn 中文文档
安装 scikit-learn
1. 监督学习
1.0 监督学习
1.1. 广义线性模型
1.2. 线性和二次判别分析
1.3. 内核岭回归
1.4. 支持向量机
1.5. 随机梯度下降
1.6. 最近邻
1.7. 高斯过程
1.8. 交叉分解
1.9. 朴素贝叶斯
1.10. 决策树
1.11. 集成方法
1.12. 多类和多标签算法
1.13. 特征选择
1.14. 半监督学习
1.15. 等式回归
1.16. 概率校准
1.17. 神经网络模型(有监督)
2. 无监督学习
2.0 无监督学习
2.1. 高斯混合模型
2.2. 流形学习
2.3. 聚类
2.4. 双聚类
2.5. 分解成分中的信号(矩阵分解问题)
2.6. 协方差估计
2.7. 新奇和异常值检测
2.8. 密度估计
2.9. 神经网络模型(无监督)
3. 模型选择和评估
3.0 模型选择和评估
3.1. 交叉验证:评估估算器的表现
3.2. 调整估计器的超参数
3.3. 模型评估: 量化预测的质量
3.4. 模型持久化
3.5. 验证曲线: 绘制分数以评估模型
4. 检验
4.0 检验
4.1. 部分依赖图
5. 数据集转换
5.0 数据集转换
5.1. Pipeline(管道)和 FeatureUnion(特征联合): 合并的评估器
5.2. 特征提取
5.3 预处理数据
5.4 缺失值插补
5.5. 无监督降维
5.6. 随机投影
5.7. 内核近似
5.8. 成对的矩阵, 类别和核函数
5.9. 预测目标
6. 数据集加载工具
6.0 数据集加载工具
6.1. 通用数据集 API
6.2. 玩具数据集
6.3 真实世界中的数据集
6.4. 样本生成器
6.5. 加载其他数据集
7. 使用scikit-learn计算
7.0 使用scikit-learn计算
7.1. 大规模计算的策略: 更大量的数据
7.2. 计算性能
7.3. 并行性、资源管理和配置
教程
使用 scikit-learn 介绍机器学习
关于科学数据处理的统计学习教程
关于科学数据处理的统计学习教程
机器学习: scikit-learn 中的设置以及预估对象
监督学习:从高维观察预测输出变量
模型选择:选择估计量及其参数
无监督学习: 寻求数据表示
把它们放在一起
寻求帮助
处理文本数据
选择正确的评估器(estimator.md
外部资源,视频和谈话
API 参考
常见问题
时光轴
sklearn
Docs »
scikit-learn (sklearn) 官方文档中文版
scikit-learn (sklearn) 官方文档中文版
scikit-learn 是基于 Python 语言的机器学习工具
简单高效的数据挖掘和数据分析工具
可供大家在各种环境中重复使用
建立在 NumPy ,SciPy 和 matplotlib 上
开源,可商业使用 - BSD许可证
维护地址
Github
在线阅读
EPUB 格式
目录
安装 scikit-learn
用户指南
1. 监督学习
1.1. 广义线性模型
1.2. 线性和二次判别分析
1.3. 内核岭回归
1.4. 支持向量机
1.5. 随机梯度下降
1.6. 最近邻
1.7. 高斯过程
1.8. 交叉分解
1.9. 朴素贝叶斯
1.10. 决策树
1.11. 集成方法
1.12. 多类和多标签算法
1.13. 特征选择
1.14. 半监督学习
1.15. 等式回归
1.16. 概率校准
1.17. 神经网络模型(有监督)
2. 无监督学习
2.1. 高斯混合模型
2.2. 流形学习
2.3. 聚类
2.4. 双聚类
2.5. 分解成分中的信号(矩阵分解问题)
2.6. 协方差估计
2.7. 新奇和异常值检测
2.8. 密度估计
2.9. 神经网络模型(无监督)
3. 模型选择和评估
3.1. 交叉验证:评估估算器的表现
3.2. 调整估计器的超参数
3.3. 模型评估: 量化预测的质量
3.4. 模型持久化
3.5. 验证曲线: 绘制分数以评估模型
4. 检验
4.1. 部分依赖图
5. 数据集转换
5.1. Pipeline(管道)和 FeatureUnion(特征联合): 合并的评估器
5.2. 特征提取
5.3 预处理数据
5.4 缺失值插补
5.5. 无监督降维
5.6. 随机投影
5.7. 内核近似
5.8. 成对的矩阵, 类别和核函数
5.9. 预测目标 (y) 的转换
6. 数据集加载工具
6.1. 通用数据集 API
6.2. 玩具数据集
6.3 真实世界中的数据集
6.4. 样本生成器
6.5. 加载其他数据集
7. 使用scikit-learn计算
7.1. 大规模计算的策略: 更大量的数据
7.2. 计算性能
7.3. 并行性、资源管理和配置
教程
使用 scikit-learn 介绍机器学习
关于科学数据处理的统计学习教程
机器学习: scikit-learn 中的设置以及预估对象
监督学习:从高维观察预测输出变量
模型选择:选择估计量及其参数
无监督学习: 寻求数据表示
把它们放在一起
寻求帮助
处理文本数据
选择正确的评估器(estimator.md)
外部资源,视频和谈话
API 参考
常见问题
时光轴
贡献指南
项目当前处于校对阶段,请查看贡献指南,并在整体进度中领取任务。
请您勇敢地去翻译和改进翻译。虽然我们追求卓越,但我们并不要求您做到十全十美,因此请不要担心因为翻译上犯错——在大部分情况下,我们的服务器已经记录所有的翻译,因此您不必担心会因为您的失误遭到无法挽回的破坏。(改编自维基百科)
项目负责人
格式: GitHub + QQ
@mahaoyang:992635910
@loopyme:3322728009
飞龙:562826179
片刻:529815144
-- 负责人要求: (欢迎一起为 sklearn 中文版本 做贡献)
热爱开源,喜欢装逼
长期使用 sklearn(至少0.5年) + 提交Pull Requests>=3
能够有时间及时优化页面 bug 和用户 issues
试用期: 2个月
欢迎联系: 片刻 529815144
项目协议
以各项目协议为准。
ApacheCN 账号下没有协议的项目,一律视为 CC BY-NC-SA 4.0。
建议反馈
在我们的 apachecn/pytorch-doc-zh github 上提 issue.
发邮件到 Email: apachecn@163.com.
在我们的 QQ群-搜索: 交流方式 中联系群主/管理员即可.
赞助我们
sklearn PythonOK 协议:CC BY-NC-SA 4.0
Built with MkDocs using a theme provided by Read the Docs.