bitpie苹果版下载官网|sklearn官方文档 _比特派(Bitpie)官网-比特派钱包app官方下载-bitpie官网下载app

scikit-learn: machine learning in Python — scikit-learn 1.4.1 documentation

Install

User Guide

API

Examples

Community

Getting Started

Tutorial

What's new

Glossary

Development

FAQ

Support

Related packages

Roadmap

Governance

About us

GitHub

Other Versions and Download

Getting Started

Tutorial

What's new

Glossary

Development

FAQ

Support

Related packages

Roadmap

Governance

About us

GitHub

Other Versions and Download

scikit-learn

Machine Learning in Python

Getting Started

Release Highlights for 1.4

GitHub

Simple and efficient tools for predictive data analysis

Accessible to everybody, and reusable in various contexts

Built on NumPy, SciPy, and matplotlib

Open source, commercially usable - BSD license

Classification

Identifying which category an object belongs to.

Applications: Spam detection, image recognition.

Algorithms:

Gradient boosting,

nearest neighbors,

random forest,

logistic regression,

and more...

Examples

Regression

Predicting a continuous-valued attribute associated with an object.

Applications: Drug response, Stock prices.

Algorithms:

Gradient boosting,

nearest neighbors,

random forest,

ridge,

and more...

Examples

Clustering

Automatic grouping of similar objects into sets.

Applications: Customer segmentation, Grouping experiment outcomes

Algorithms:

k-Means,

HDBSCAN,

hierarchical

clustering,

and more...

Examples

Dimensionality reduction

Reducing the number of random variables to consider.

Applications: Visualization, Increased efficiency

Algorithms:

PCA,

feature selection,

non-negative matrix factorization,

and more...

Examples

Model selection

Comparing, validating and choosing parameters and models.

Applications: Improved accuracy via parameter tuning

Algorithms:

grid search,

cross validation,

metrics,

and more...

Examples

Preprocessing

Feature extraction and normalization.

Applications: Transforming input data such as text for use with machine learning algorithms.

Algorithms:

preprocessing,

feature extraction,

and more...

Examples

News

On-going development:

scikit-learn 1.5 (Changelog)

February 2024. scikit-learn 1.4.1.post1 is available for download (Changelog).

January 2024. scikit-learn 1.4.0 is available for download (Changelog).

October 2023. scikit-learn 1.3.2 is available for download (Changelog).

September 2023. scikit-learn 1.3.1 is available for download (Changelog).

June 2023. scikit-learn 1.3.0 is available for download (Changelog).

All releases:

What's new (Changelog)

Community

About us: See authors and contributing

More Machine Learning: Find related projects

Questions? See FAQ and stackoverflow

Subscribe to the mailing list

Gitter: gitter.im/scikit-learn

Blog: blog.scikit-learn.org

Logos & Branding: logos and branding

Calendar: calendar

Twitter: @scikit_learn

LinkedIn: linkedin/scikit-learn

YouTube: youtube.com/scikit-learn

Facebook: @scikitlearnofficial

Instagram: @scikitlearnofficial

TikTok: @scikit.learn

Communication on all channels should respect PSF's code of conduct.

Help us, donate!

Cite us!

Who uses scikit-learn?

"We use scikit-learn to support leading-edge basic research [...]"

"I think it's the most well-designed ML package I've seen so far."

"scikit-learn's ease-of-use, performance and overall variety of algorithms implemented has proved invaluable [...]."

"The great benefit of scikit-learn is its fast learning curve [...]"

"It allows us to do AWesome stuff we would not otherwise accomplish"

"scikit-learn makes doing advanced analysis in Python accessible to anyone."

More testimonials

scikit-learn development and maintenance are financially supported by

User guide: contents — scikit-learn 1.4.1 documentation

Install

User Guide

API

Examples

Community

Getting Started

Tutorial

What's new

Glossary

Development

FAQ

Support

Related packages

Roadmap

Governance

About us

GitHub

Other Versions and Download

Getting Started

Tutorial

What's new

Glossary

Development

FAQ

Support

Related packages

Roadmap

Governance

About us

GitHub

Other Versions and Download

Toggle Menu

scikit-learn 1.4.1

Other versions

Please cite us if you use the software.

User Guide

1. Supervised learning

2. Unsupervised learning

3. Model selection and evaluation

4. Inspection

5. Visualizations

6. Dataset transformations

7. Dataset loading utilities

8. Computing with scikit-learn

9. Model persistence

10. Common pitfalls and recommended practices

11. Dispatching

User Guide¶

1. Supervised learning

1.1. Linear Models

1.1.1. Ordinary Least Squares

1.1.2. Ridge regression and classification

1.1.3. Lasso

1.1.4. Multi-task Lasso

1.1.5. Elastic-Net

1.1.6. Multi-task Elastic-Net

1.1.7. Least Angle Regression

1.1.8. LARS Lasso

1.1.9. Orthogonal Matching Pursuit (OMP)

1.1.10. Bayesian Regression

1.1.11. Logistic regression

1.1.12. Generalized Linear Models

1.1.13. Stochastic Gradient Descent - SGD

1.1.14. Perceptron

1.1.15. Passive Aggressive Algorithms

1.1.16. Robustness regression: outliers and modeling errors

1.1.17. Quantile Regression

1.1.18. Polynomial regression: extending linear models with basis functions

1.2. Linear and Quadratic Discriminant Analysis

1.2.1. Dimensionality reduction using Linear Discriminant Analysis

1.2.2. Mathematical formulation of the LDA and QDA classifiers

1.2.3. Mathematical formulation of LDA dimensionality reduction

1.2.4. Shrinkage and Covariance Estimator

1.2.5. Estimation algorithms

1.3. Kernel ridge regression

1.4. Support Vector Machines

1.4.1. Classification

1.4.2. Regression

1.4.3. Density estimation, novelty detection

1.4.4. Complexity

1.4.5. Tips on Practical Use

1.4.6. Kernel functions

1.4.7. Mathematical formulation

1.4.8. Implementation details

1.5. Stochastic Gradient Descent

1.5.1. Classification

1.5.2. Regression

1.5.3. Online One-Class SVM

1.5.4. Stochastic Gradient Descent for sparse data

1.5.5. Complexity

1.5.6. Stopping criterion

1.5.7. Tips on Practical Use

1.5.8. Mathematical formulation

1.5.9. Implementation details

1.6. Nearest Neighbors

1.6.1. Unsupervised Nearest Neighbors

1.6.2. Nearest Neighbors Classification

1.6.3. Nearest Neighbors Regression

1.6.4. Nearest Neighbor Algorithms

1.6.5. Nearest Centroid Classifier

1.6.6. Nearest Neighbors Transformer

1.6.7. Neighborhood Components Analysis

1.7. Gaussian Processes

1.7.1. Gaussian Process Regression (GPR)

1.7.2. Gaussian Process Classification (GPC)

1.7.3. GPC examples

1.7.4. Kernels for Gaussian Processes

1.8. Cross decomposition

1.8.1. PLSCanonical

1.8.2. PLSSVD

1.8.3. PLSRegression

1.8.4. Canonical Correlation Analysis

1.9. Naive Bayes

1.9.1. Gaussian Naive Bayes

1.9.2. Multinomial Naive Bayes

1.9.3. Complement Naive Bayes

1.9.4. Bernoulli Naive Bayes

1.9.5. Categorical Naive Bayes

1.9.6. Out-of-core naive Bayes model fitting

1.10. Decision Trees

1.10.1. Classification

1.10.2. Regression

1.10.3. Multi-output problems

1.10.4. Complexity

1.10.5. Tips on practical use

1.10.6. Tree algorithms: ID3, C4.5, C5.0 and CART

1.10.7. Mathematical formulation

1.10.8. Missing Values Support

1.10.9. Minimal Cost-Complexity Pruning

1.11. Ensembles: Gradient boosting, random forests, bagging, voting, stacking

1.11.1. Gradient-boosted trees

1.11.2. Random forests and other randomized tree ensembles

1.11.3. Bagging meta-estimator

1.11.4. Voting Classifier

1.11.5. Voting Regressor

1.11.6. Stacked generalization

1.11.7. AdaBoost

1.12. Multiclass and multioutput algorithms

1.12.1. Multiclass classification

1.12.2. Multilabel classification

1.12.3. Multiclass-multioutput classification

1.12.4. Multioutput regression

1.13. Feature selection

1.13.1. Removing features with low variance

1.13.2. Univariate feature selection

1.13.3. Recursive feature elimination

1.13.4. Feature selection using SelectFromModel

1.13.5. Sequential Feature Selection

1.13.6. Feature selection as part of a pipeline

1.14. Semi-supervised learning

1.14.1. Self Training

1.14.2. Label Propagation

1.15. Isotonic regression

1.16. Probability calibration

1.16.1. Calibration curves

1.16.2. Calibrating a classifier

1.16.3. Usage

1.17. Neural network models (supervised)

1.17.1. Multi-layer Perceptron

1.17.2. Classification

1.17.3. Regression

1.17.4. Regularization

1.17.5. Algorithms

1.17.6. Complexity

1.17.7. Mathematical formulation

1.17.8. Tips on Practical Use

1.17.9. More control with warm_start

2. Unsupervised learning

2.1. Gaussian mixture models

2.1.1. Gaussian Mixture

2.1.2. Variational Bayesian Gaussian Mixture

2.2. Manifold learning

2.2.1. Introduction

2.2.2. Isomap

2.2.3. Locally Linear Embedding

2.2.4. Modified Locally Linear Embedding

2.2.5. Hessian Eigenmapping

2.2.6. Spectral Embedding

2.2.7. Local Tangent Space Alignment

2.2.8. Multi-dimensional Scaling (MDS)

2.2.9. t-distributed Stochastic Neighbor Embedding (t-SNE)

2.2.10. Tips on practical use

2.3. Clustering

2.3.1. Overview of clustering methods

2.3.2. K-means

2.3.3. Affinity Propagation

2.3.4. Mean Shift

2.3.5. Spectral clustering

2.3.6. Hierarchical clustering

2.3.7. DBSCAN

2.3.8. HDBSCAN

2.3.9. OPTICS

2.3.10. BIRCH

2.3.11. Clustering performance evaluation

2.4. Biclustering

2.4.1. Spectral Co-Clustering

2.4.2. Spectral Biclustering

2.4.3. Biclustering evaluation

2.5. Decomposing signals in components (matrix factorization problems)

2.5.1. Principal component analysis (PCA)

2.5.2. Kernel Principal Component Analysis (kPCA)

2.5.3. Truncated singular value decomposition and latent semantic analysis

2.5.4. Dictionary Learning

2.5.5. Factor Analysis

2.5.6. Independent component analysis (ICA)

2.5.7. Non-negative matrix factorization (NMF or NNMF)

2.5.8. Latent Dirichlet Allocation (LDA)

2.6. Covariance estimation

2.6.1. Empirical covariance

2.6.2. Shrunk Covariance

2.6.3. Sparse inverse covariance

2.6.4. Robust Covariance Estimation

2.7. Novelty and Outlier Detection

2.7.1. Overview of outlier detection methods

2.7.2. Novelty Detection

2.7.3. Outlier Detection

2.7.4. Novelty detection with Local Outlier Factor

2.8. Density Estimation

2.8.1. Density Estimation: Histograms

2.8.2. Kernel Density Estimation

2.9. Neural network models (unsupervised)

2.9.1. Restricted Boltzmann machines

3. Model selection and evaluation

3.1. Cross-validation: evaluating estimator performance

3.1.1. Computing cross-validated metrics

3.1.2. Cross validation iterators

3.1.3. A note on shuffling

3.1.4. Cross validation and model selection

3.1.5. Permutation test score

3.2. Tuning the hyper-parameters of an estimator

3.2.1. Exhaustive Grid Search

3.2.2. Randomized Parameter Optimization

3.2.3. Searching for optimal parameters with successive halving

3.2.4. Tips for parameter search

3.2.5. Alternatives to brute force parameter search

3.3. Metrics and scoring: quantifying the quality of predictions

3.3.1. The scoring parameter: defining model evaluation rules

3.3.2. Classification metrics

3.3.3. Multilabel ranking metrics

3.3.4. Regression metrics

3.3.5. Clustering metrics

3.3.6. Dummy estimators

3.4. Validation curves: plotting scores to evaluate models

3.4.1. Validation curve

3.4.2. Learning curve

4. Inspection

4.1. Partial Dependence and Individual Conditional Expectation plots

4.1.1. Partial dependence plots

4.1.2. Individual conditional expectation (ICE) plot

4.1.3. Mathematical Definition

4.1.4. Computation methods

4.2. Permutation feature importance

4.2.1. Outline of the permutation importance algorithm

4.2.2. Relation to impurity-based importance in trees

4.2.3. Misleading values on strongly correlated features

5. Visualizations

5.1. Available Plotting Utilities

5.1.1. Display Objects

6. Dataset transformations

6.1. Pipelines and composite estimators

6.1.1. Pipeline: chaining estimators

6.1.2. Transforming target in regression

6.1.3. FeatureUnion: composite feature spaces

6.1.4. ColumnTransformer for heterogeneous data

6.1.5. Visualizing Composite Estimators

6.2. Feature extraction

6.2.1. Loading features from dicts

6.2.2. Feature hashing

6.2.3. Text feature extraction

6.2.4. Image feature extraction

6.3. Preprocessing data

6.3.1. Standardization, or mean removal and variance scaling

6.3.2. Non-linear transformation

6.3.3. Normalization

6.3.4. Encoding categorical features

6.3.5. Discretization

6.3.6. Imputation of missing values

6.3.7. Generating polynomial features

6.3.8. Custom transformers

6.4. Imputation of missing values

6.4.1. Univariate vs. Multivariate Imputation

6.4.2. Univariate feature imputation

6.4.3. Multivariate feature imputation

6.4.4. Nearest neighbors imputation

6.4.5. Keeping the number of features constant

6.4.6. Marking imputed values

6.4.7. Estimators that handle NaN values

6.5. Unsupervised dimensionality reduction

6.5.1. PCA: principal component analysis

6.5.2. Random projections

6.5.3. Feature agglomeration

6.6. Random Projection

6.6.1. The Johnson-Lindenstrauss lemma

6.6.2. Gaussian random projection

6.6.3. Sparse random projection

6.6.4. Inverse Transform

6.7. Kernel Approximation

6.7.1. Nystroem Method for Kernel Approximation

6.7.2. Radial Basis Function Kernel

6.7.3. Additive Chi Squared Kernel

6.7.4. Skewed Chi Squared Kernel

6.7.5. Polynomial Kernel Approximation via Tensor Sketch

6.7.6. Mathematical Details

6.8. Pairwise metrics, Affinities and Kernels

6.8.1. Cosine similarity

6.8.2. Linear kernel

6.8.3. Polynomial kernel

6.8.4. Sigmoid kernel

6.8.5. RBF kernel

6.8.6. Laplacian kernel

6.8.7. Chi-squared kernel

6.9. Transforming the prediction target (y)

6.9.1. Label binarization

6.9.2. Label encoding

7. Dataset loading utilities

7.1. Toy datasets

7.1.1. Iris plants dataset

7.1.2. Diabetes dataset

7.1.3. Optical recognition of handwritten digits dataset

7.1.4. Linnerrud dataset

7.1.5. Wine recognition dataset

7.1.6. Breast cancer wisconsin (diagnostic) dataset

7.2. Real world datasets

7.2.1. The Olivetti faces dataset

7.2.2. The 20 newsgroups text dataset

7.2.3. The Labeled Faces in the Wild face recognition dataset

7.2.4. Forest covertypes

7.2.5. RCV1 dataset

7.2.6. Kddcup 99 dataset

7.2.7. California Housing dataset

7.2.8. Species distribution dataset

7.3. Generated datasets

7.3.1. Generators for classification and clustering

7.3.2. Generators for regression

7.3.3. Generators for manifold learning

7.3.4. Generators for decomposition

7.4. Loading other datasets

7.4.1. Sample images

7.4.2. Datasets in svmlight / libsvm format

7.4.3. Downloading datasets from the openml.org repository

7.4.4. Loading from external datasets

8. Computing with scikit-learn

8.1. Strategies to scale computationally: bigger data

8.1.1. Scaling with instances using out-of-core learning

8.2. Computational Performance

8.2.1. Prediction Latency

8.2.2. Prediction Throughput

8.2.3. Tips and Tricks

8.3. Parallelism, resource management, and configuration

8.3.1. Parallelism

8.3.2. Configuration switches

9. Model persistence

9.1. Python specific serialization

9.1.1. Security & maintainability limitations

9.1.2. A more secure format: skops

9.2. Interoperable formats

10. Common pitfalls and recommended practices

10.1. Inconsistent preprocessing

10.2. Data leakage

10.2.1. How to avoid data leakage

10.2.2. Data leakage during pre-processing

10.3. Controlling randomness

10.3.1. Using None or RandomState instances, and repeated calls to fit and split

10.3.2. Common pitfalls and subtleties

10.3.3. General recommendations

11. Dispatching

11.1. Array API support (experimental)

11.1.1. Example usage

11.1.2. Support for Array API-compatible inputs

11.1.3. Common estimator checks

Under Development¶

1. Metadata Routing

Show this page source

sklearn.svm.SVC — scikit-learn 1.4.1 documentation

Install

User Guide

API

Examples

Community

Getting Started

Tutorial

What's new

Glossary

Development

FAQ

Support

Related packages

Roadmap

Governance

About us

GitHub

Other Versions and Download

Getting Started

Tutorial

What's new

Glossary

Development

FAQ

Support

Related packages

Roadmap

Governance

About us

GitHub

Other Versions and Download

Toggle Menu

PrevUp

scikit-learn 1.4.1

Other versions

Please cite us if you use the software.

sklearn.svm.SVC

SVC

SVC.coef_

SVC.decision_function

SVC.fit

SVC.get_metadata_routing

SVC.get_params

SVC.n_support_

SVC.predict

SVC.predict_log_proba

SVC.predict_proba

SVC.probA_

SVC.probB_

SVC.score

SVC.set_fit_request

SVC.set_params

SVC.set_score_request

Examples using sklearn.svm.SVC

sklearn.svm.SVC¶

class sklearn.svm.SVC(*, C=1.0, kernel='rbf', degree=3, gamma='scale', coef0=0.0, shrinking=True, probability=False, tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=-1, decision_function_shape='ovr', break_ties=False, random_state=None)[source]¶

C-Support Vector Classification.

The implementation is based on libsvm. The fit time scales at least

quadratically with the number of samples and may be impractical

beyond tens of thousands of samples. For large datasets

consider using LinearSVC or

SGDClassifier instead, possibly after a

Nystroem transformer or

other Kernel Approximation.

The multiclass support is handled according to a one-vs-one scheme.

For details on the precise mathematical formulation of the provided

kernel functions and how gamma, coef0 and degree affect each

other, see the corresponding section in the narrative documentation:

Kernel functions.

To learn how to tune SVC’s hyperparameters, see the following example:

Nested versus non-nested cross-validation

Read more in the User Guide.

Parameters:

Cfloat, default=1.0Regularization parameter. The strength of the regularization is

inversely proportional to C. Must be strictly positive. The penalty

is a squared l2 penalty.

kernel{‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’} or callable, default=’rbf’Specifies the kernel type to be used in the algorithm. If

none is given, ‘rbf’ will be used. If a callable is given it is used to

pre-compute the kernel matrix from data matrices; that matrix should be

an array of shape (n_samples, n_samples). For an intuitive

visualization of different kernel types see

Plot classification boundaries with different SVM Kernels.

degreeint, default=3Degree of the polynomial kernel function (‘poly’).

Must be non-negative. Ignored by all other kernels.

gamma{‘scale’, ‘auto’} or float, default=’scale’Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’.

if gamma='scale' (default) is passed then it uses

1 / (n_features * X.var()) as value of gamma,

if ‘auto’, uses 1 / n_features

if float, must be non-negative.

Changed in version 0.22: The default value of gamma changed from ‘auto’ to ‘scale’.

coef0float, default=0.0Independent term in kernel function.

It is only significant in ‘poly’ and ‘sigmoid’.

shrinkingbool, default=TrueWhether to use the shrinking heuristic.

See the User Guide.

probabilitybool, default=FalseWhether to enable probability estimates. This must be enabled prior

to calling fit, will slow down that method as it internally uses

5-fold cross-validation, and predict_proba may be inconsistent with

predict. Read more in the User Guide.

tolfloat, default=1e-3Tolerance for stopping criterion.

cache_sizefloat, default=200Specify the size of the kernel cache (in MB).

class_weightdict or ‘balanced’, default=NoneSet the parameter C of class i to class_weight[i]*C for

SVC. If not given, all classes are supposed to have

weight one.

The “balanced” mode uses the values of y to automatically adjust

weights inversely proportional to class frequencies in the input data

as n_samples / (n_classes * np.bincount(y)).

verbosebool, default=FalseEnable verbose output. Note that this setting takes advantage of a

per-process runtime setting in libsvm that, if enabled, may not work

properly in a multithreaded context.

max_iterint, default=-1Hard limit on iterations within solver, or -1 for no limit.

decision_function_shape{‘ovo’, ‘ovr’}, default=’ovr’Whether to return a one-vs-rest (‘ovr’) decision function of shape

(n_samples, n_classes) as all other classifiers, or the original

one-vs-one (‘ovo’) decision function of libsvm which has shape

(n_samples, n_classes * (n_classes - 1) / 2). However, note that

internally, one-vs-one (‘ovo’) is always used as a multi-class strategy

to train models; an ovr matrix is only constructed from the ovo matrix.

The parameter is ignored for binary classification.

Changed in version 0.19: decision_function_shape is ‘ovr’ by default.

New in version 0.17: decision_function_shape=’ovr’ is recommended.

Changed in version 0.17: Deprecated decision_function_shape=’ovo’ and None.

break_tiesbool, default=FalseIf true, decision_function_shape='ovr', and number of classes > 2,

predict will break ties according to the confidence values of

decision_function; otherwise the first class among the tied

classes is returned. Please note that breaking ties comes at a

relatively high computational cost compared to a simple predict.

New in version 0.22.

random_stateint, RandomState instance or None, default=NoneControls the pseudo random number generation for shuffling the data for

probability estimates. Ignored when probability is False.

Pass an int for reproducible output across multiple function calls.

See Glossary.

Attributes:

class_weight_ndarray of shape (n_classes,)Multipliers of parameter C for each class.

Computed based on the class_weight parameter.

classes_ndarray of shape (n_classes,)The classes labels.

coef_ndarray of shape (n_classes * (n_classes - 1) / 2, n_features)Weights assigned to the features when kernel="linear".

dual_coef_ndarray of shape (n_classes -1, n_SV)Dual coefficients of the support vector in the decision

function (see Mathematical formulation), multiplied by

their targets.

For multiclass, coefficient for all 1-vs-1 classifiers.

The layout of the coefficients in the multiclass case is somewhat

non-trivial. See the multi-class section of the User Guide for details.

fit_status_int0 if correctly fitted, 1 otherwise (will raise warning)

intercept_ndarray of shape (n_classes * (n_classes - 1) / 2,)Constants in decision function.

n_features_in_intNumber of features seen during fit.

New in version 0.24.

feature_names_in_ndarray of shape (n_features_in_,)Names of features seen during fit. Defined only when X

has feature names that are all strings.

New in version 1.0.

n_iter_ndarray of shape (n_classes * (n_classes - 1) // 2,)Number of iterations run by the optimization routine to fit the model.

The shape of this attribute depends on the number of models optimized

which in turn depends on the number of classes.

New in version 1.1.

support_ndarray of shape (n_SV)Indices of support vectors.

support_vectors_ndarray of shape (n_SV, n_features)Support vectors. An empty array if kernel is precomputed.

n_support_ndarray of shape (n_classes,), dtype=int32Number of support vectors for each class.

probA_ndarray of shape (n_classes * (n_classes - 1) / 2)Parameter learned in Platt scaling when probability=True.

probB_ndarray of shape (n_classes * (n_classes - 1) / 2)Parameter learned in Platt scaling when probability=True.

shape_fit_tuple of int of shape (n_dimensions_of_X,)Array dimensions of training vector X.

See also

SVRSupport Vector Machine for Regression implemented using libsvm.

LinearSVCScalable Linear Support Vector Machine for classification implemented using liblinear. Check the See Also section of LinearSVC for more comparison element.

References

[1]

LIBSVM: A Library for Support Vector Machines

[2]

Platt, John (1999). “Probabilistic Outputs for Support Vector

Machines and Comparisons to Regularized Likelihood Methods”

Examples

>>> import numpy as np

>>> from sklearn.pipeline import make_pipeline

>>> from sklearn.preprocessing import StandardScaler

>>> X = np.array([[-1, -1], [-2, -1], [1, 1], [2, 1]])

>>> y = np.array([1, 1, 2, 2])

>>> from sklearn.svm import SVC

>>> clf = make_pipeline(StandardScaler(), SVC(gamma='auto'))

>>> clf.fit(X, y)

Pipeline(steps=[('standardscaler', StandardScaler()),

('svc', SVC(gamma='auto'))])

>>> print(clf.predict([[-0.8, -1]]))

[1]

Methods

decision_function(X)

Evaluate the decision function for the samples in X.

fit(X, y[, sample_weight])

Fit the SVM model according to the given training data.

get_metadata_routing()

Get metadata routing of this object.

get_params([deep])

Get parameters for this estimator.

predict(X)

Perform classification on samples in X.

predict_log_proba(X)

Compute log probabilities of possible outcomes for samples in X.

predict_proba(X)

Compute probabilities of possible outcomes for samples in X.

score(X, y[, sample_weight])

Return the mean accuracy on the given test data and labels.

set_fit_request(*[, sample_weight])

Request metadata passed to the fit method.

set_params(**params)

Set the parameters of this estimator.

set_score_request(*[, sample_weight])

Request metadata passed to the score method.

property coef_¶

Weights assigned to the features when kernel="linear".

Returns:

ndarray of shape (n_features, n_classes)

decision_function(X)[source]¶

Evaluate the decision function for the samples in X.

Parameters:

Xarray-like of shape (n_samples, n_features)The input samples.

Returns:

Xndarray of shape (n_samples, n_classes * (n_classes-1) / 2)Returns the decision function of the sample for each class

in the model.

If decision_function_shape=’ovr’, the shape is (n_samples,

n_classes).

Notes

If decision_function_shape=’ovo’, the function values are proportional

to the distance of the samples X to the separating hyperplane. If the

exact distances are required, divide the function values by the norm of

the weight vector (coef_). See also this question for further details.

If decision_function_shape=’ovr’, the decision function is a monotonic

transformation of ovo decision function.

fit(X, y, sample_weight=None)[source]¶

Fit the SVM model according to the given training data.

Parameters:

X{array-like, sparse matrix} of shape (n_samples, n_features) or (n_samples, n_samples)Training vectors, where n_samples is the number of samples

and n_features is the number of features.

For kernel=”precomputed”, the expected shape of X is

(n_samples, n_samples).

yarray-like of shape (n_samples,)Target values (class labels in classification, real numbers in

regression).

sample_weightarray-like of shape (n_samples,), default=NonePer-sample weights. Rescale C per sample. Higher weights

force the classifier to put more emphasis on these points.

Returns:

selfobjectFitted estimator.

Notes

If X and y are not C-ordered and contiguous arrays of np.float64 and

X is not a scipy.sparse.csr_matrix, X and/or y may be copied.

If X is a dense array, then the other methods will not support sparse

matrices as input.

get_metadata_routing()[source]¶

Get metadata routing of this object.

Please check User Guide on how the routing

mechanism works.

Returns:

routingMetadataRequestA MetadataRequest encapsulating

routing information.

get_params(deep=True)[source]¶

Get parameters for this estimator.

Parameters:

deepbool, default=TrueIf True, will return the parameters for this estimator and

contained subobjects that are estimators.

Returns:

paramsdictParameter names mapped to their values.

property n_support_¶

Number of support vectors for each class.

predict(X)[source]¶

Perform classification on samples in X.

For an one-class model, +1 or -1 is returned.

Parameters:

X{array-like, sparse matrix} of shape (n_samples, n_features) or (n_samples_test, n_samples_train)For kernel=”precomputed”, the expected shape of X is

(n_samples_test, n_samples_train).

Returns:

y_predndarray of shape (n_samples,)Class labels for samples in X.

predict_log_proba(X)[source]¶

Compute log probabilities of possible outcomes for samples in X.

The model need to have probability information computed at training

time: fit with attribute probability set to True.

Parameters:

Xarray-like of shape (n_samples, n_features) or (n_samples_test, n_samples_train)For kernel=”precomputed”, the expected shape of X is

(n_samples_test, n_samples_train).

Returns:

Tndarray of shape (n_samples, n_classes)Returns the log-probabilities of the sample for each class in

the model. The columns correspond to the classes in sorted

order, as they appear in the attribute classes_.

Notes

The probability model is created using cross validation, so

the results can be slightly different than those obtained by

predict. Also, it will produce meaningless results on very small

datasets.

predict_proba(X)[source]¶

Compute probabilities of possible outcomes for samples in X.

The model needs to have probability information computed at training

time: fit with attribute probability set to True.

Parameters:

Xarray-like of shape (n_samples, n_features)For kernel=”precomputed”, the expected shape of X is

(n_samples_test, n_samples_train).

Returns:

Tndarray of shape (n_samples, n_classes)Returns the probability of the sample for each class in

the model. The columns correspond to the classes in sorted

order, as they appear in the attribute classes_.

Notes

The probability model is created using cross validation, so

the results can be slightly different than those obtained by

predict. Also, it will produce meaningless results on very small

datasets.

property probA_¶

Parameter learned in Platt scaling when probability=True.

Returns:

ndarray of shape (n_classes * (n_classes - 1) / 2)

property probB_¶

Parameter learned in Platt scaling when probability=True.

Returns:

ndarray of shape (n_classes * (n_classes - 1) / 2)

score(X, y, sample_weight=None)[source]¶

Return the mean accuracy on the given test data and labels.

In multi-label classification, this is the subset accuracy

which is a harsh metric since you require for each sample that

each label set be correctly predicted.

Parameters:

Xarray-like of shape (n_samples, n_features)Test samples.

yarray-like of shape (n_samples,) or (n_samples, n_outputs)True labels for X.

sample_weightarray-like of shape (n_samples,), default=NoneSample weights.

Returns:

scorefloatMean accuracy of self.predict(X) w.r.t. y.

set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') → SVC[source]¶

Request metadata passed to the fit method.

Note that this method is only relevant if

enable_metadata_routing=True (see sklearn.set_config).

Please see User Guide on how the routing

mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

False: metadata is not requested and the meta-estimator will not pass it to fit.

None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the

existing request. This allows you to change the request for some

parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a

sub-estimator of a meta-estimator, e.g. used inside a

Pipeline. Otherwise it has no effect.

Parameters:

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGEDMetadata routing for sample_weight parameter in fit.

Returns:

selfobjectThe updated object.

set_params(**params)[source]¶

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects

(such as Pipeline). The latter have

parameters of the form __ so that it’s

possible to update each component of a nested object.

Parameters:

**paramsdictEstimator parameters.

Returns:

selfestimator instanceEstimator instance.

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') → SVC[source]¶

Request metadata passed to the score method.

Note that this method is only relevant if

enable_metadata_routing=True (see sklearn.set_config).

Please see User Guide on how the routing

mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

False: metadata is not requested and the meta-estimator will not pass it to score.

None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the

existing request. This allows you to change the request for some

parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a

sub-estimator of a meta-estimator, e.g. used inside a

Pipeline. Otherwise it has no effect.

Parameters:

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGEDMetadata routing for sample_weight parameter in score.

Returns:

selfobjectThe updated object.

Examples using sklearn.svm.SVC¶

Release Highlights for scikit-learn 0.24

Release Highlights for scikit-learn 0.22

Classifier comparison

Plot classification probability

Recognizing hand-written digits

Plot the decision boundaries of a VotingClassifier

Faces recognition example using eigenfaces and SVMs

Libsvm GUI

Recursive feature elimination

Scalable learning with polynomial kernel approximation

Displaying Pipelines

Explicit feature map approximation for RBF kernels

Multilabel classification

ROC Curve with Visualization API

Comparison between grid search and successive halving

Confusion matrix

Custom refit strategy of a grid search with cross-validation

Nested versus non-nested cross-validation

Plotting Learning Curves and Checking Models’ Scalability

Plotting Learning Curves and Checking Models' Scalability

Plotting Validation Curves

Receiver Operating Characteristic (ROC) with cross validation

Statistical comparison of models using grid search

Test with permutations the significance of a classification score

Concatenating multiple feature extraction methods

Feature discretization

Decision boundary of semi-supervised classifiers versus SVM on the Iris dataset

Effect of varying threshold for self-training

Plot classification boundaries with different SVM Kernels

Plot different SVM classifiers in the iris dataset

RBF SVM parameters

SVM Margins Example

SVM Tie Breaking Example

SVM with custom kernel

SVM-Anova: SVM with univariate feature selection

SVM: Maximum margin separating hyperplane

SVM: Separating hyperplane for unbalanced classes

SVM: Weighted samples

SVM Exercise

Show this page source

sklearn.decomposition.PCA — scikit-learn 1.4.1 documentation

Install

User Guide

API

Examples

Community

Getting Started

Tutorial

What's new

Glossary

Development

FAQ

Support

Related packages

Roadmap

Governance

About us

GitHub

Other Versions and Download

Getting Started

Tutorial

What's new

Glossary

Development

FAQ

Support

Related packages

Roadmap

Governance

About us

GitHub

Other Versions and Download

Toggle Menu

PrevUp

scikit-learn 1.4.1

Other versions

Please cite us if you use the software.

sklearn.decomposition.PCA

PCA

PCA.fit

PCA.fit_transform

PCA.get_covariance

PCA.get_feature_names_out

PCA.get_metadata_routing

PCA.get_params

PCA.get_precision

PCA.inverse_transform

PCA.score

PCA.score_samples

PCA.set_output

PCA.set_params

PCA.transform

Examples using sklearn.decomposition.PCA

sklearn.decomposition.PCA¶

class sklearn.decomposition.PCA(n_components=None, *, copy=True, whiten=False, svd_solver='auto', tol=0.0, iterated_power='auto', n_oversamples=10, power_iteration_normalizer='auto', random_state=None)[source]¶

Principal component analysis (PCA).

Linear dimensionality reduction using Singular Value Decomposition of the

data to project it to a lower dimensional space. The input data is centered

but not scaled for each feature before applying the SVD.

It uses the LAPACK implementation of the full SVD or a randomized truncated

SVD by the method of Halko et al. 2009, depending on the shape of the input

data and the number of components to extract.

It can also use the scipy.sparse.linalg ARPACK implementation of the

truncated SVD.

Notice that this class does not support sparse input. See

TruncatedSVD for an alternative with sparse data.

For a usage example, see

PCA example with Iris Data-set

Read more in the User Guide.

Parameters:

n_componentsint, float or ‘mle’, default=NoneNumber of components to keep.

if n_components is not set all components are kept:

n_components == min(n_samples, n_features)

If n_components == 'mle' and svd_solver == 'full', Minka’s

MLE is used to guess the dimension. Use of n_components == 'mle'

will interpret svd_solver == 'auto' as svd_solver == 'full'.

If 0 < n_components < 1 and svd_solver == 'full', select the

number of components such that the amount of variance that needs to be

explained is greater than the percentage specified by n_components.

If svd_solver == 'arpack', the number of components must be

strictly less than the minimum of n_features and n_samples.

Hence, the None case results in:

n_components == min(n_samples, n_features) - 1

copybool, default=TrueIf False, data passed to fit are overwritten and running

fit(X).transform(X) will not yield the expected results,

use fit_transform(X) instead.

whitenbool, default=FalseWhen True (False by default) the components_ vectors are multiplied

by the square root of n_samples and then divided by the singular values

to ensure uncorrelated outputs with unit component-wise variances.

Whitening will remove some information from the transformed signal

(the relative variance scales of the components) but can sometime

improve the predictive accuracy of the downstream estimators by

making their data respect some hard-wired assumptions.

svd_solver{‘auto’, ‘full’, ‘arpack’, ‘randomized’}, default=’auto’

If auto :The solver is selected by a default policy based on X.shape and

n_components: if the input data is larger than 500x500 and the

number of components to extract is lower than 80% of the smallest

dimension of the data, then the more efficient ‘randomized’

method is enabled. Otherwise the exact full SVD is computed and

optionally truncated afterwards.

If full :run exact full SVD calling the standard LAPACK solver via

scipy.linalg.svd and select the components by postprocessing

If arpack :run SVD truncated to n_components calling ARPACK solver via

scipy.sparse.linalg.svds. It requires strictly

0 < n_components < min(X.shape)

If randomized :run randomized SVD by the method of Halko et al.

New in version 0.18.0.

tolfloat, default=0.0Tolerance for singular values computed by svd_solver == ‘arpack’.

Must be of range [0.0, infinity).

New in version 0.18.0.

iterated_powerint or ‘auto’, default=’auto’Number of iterations for the power method computed by

svd_solver == ‘randomized’.

Must be of range [0, infinity).

New in version 0.18.0.

n_oversamplesint, default=10This parameter is only relevant when svd_solver="randomized".

It corresponds to the additional number of random vectors to sample the

range of X so as to ensure proper conditioning. See

randomized_svd for more details.

New in version 1.1.

power_iteration_normalizer{‘auto’, ‘QR’, ‘LU’, ‘none’}, default=’auto’Power iteration normalizer for randomized SVD solver.

Not used by ARPACK. See randomized_svd

for more details.

New in version 1.1.

random_stateint, RandomState instance or None, default=NoneUsed when the ‘arpack’ or ‘randomized’ solvers are used. Pass an int

for reproducible results across multiple function calls.

See Glossary.

New in version 0.18.0.

Attributes:

components_ndarray of shape (n_components, n_features)Principal axes in feature space, representing the directions of

maximum variance in the data. Equivalently, the right singular

vectors of the centered input data, parallel to its eigenvectors.

The components are sorted by decreasing explained_variance_.

explained_variance_ndarray of shape (n_components,)The amount of variance explained by each of the selected components.

The variance estimation uses n_samples - 1 degrees of freedom.

Equal to n_components largest eigenvalues

of the covariance matrix of X.

New in version 0.18.

explained_variance_ratio_ndarray of shape (n_components,)Percentage of variance explained by each of the selected components.

If n_components is not set then all components are stored and the

sum of the ratios is equal to 1.0.

singular_values_ndarray of shape (n_components,)The singular values corresponding to each of the selected components.

The singular values are equal to the 2-norms of the n_components

variables in the lower-dimensional space.

New in version 0.19.

mean_ndarray of shape (n_features,)Per-feature empirical mean, estimated from the training set.

Equal to X.mean(axis=0).

n_components_intThe estimated number of components. When n_components is set

to ‘mle’ or a number between 0 and 1 (with svd_solver == ‘full’) this

number is estimated from input data. Otherwise it equals the parameter

n_components, or the lesser value of n_features and n_samples

if n_components is None.

n_samples_intNumber of samples in the training data.

noise_variance_floatThe estimated noise covariance following the Probabilistic PCA model

from Tipping and Bishop 1999. See “Pattern Recognition and

Machine Learning” by C. Bishop, 12.2.1 p. 574 or

http://www.miketipping.com/papers/met-mppca.pdf. It is required to

compute the estimated data covariance and score samples.

Equal to the average of (min(n_features, n_samples) - n_components)

smallest eigenvalues of the covariance matrix of X.

n_features_in_intNumber of features seen during fit.

New in version 0.24.

feature_names_in_ndarray of shape (n_features_in_,)Names of features seen during fit. Defined only when X

has feature names that are all strings.

New in version 1.0.

See also

KernelPCAKernel Principal Component Analysis.

SparsePCASparse Principal Component Analysis.

TruncatedSVDDimensionality reduction using truncated SVD.

IncrementalPCAIncremental Principal Component Analysis.

References

For n_components == ‘mle’, this class uses the method from:

Minka, T. P.. “Automatic choice of dimensionality for PCA”.

In NIPS, pp. 598-604

Implements the probabilistic PCA model from:

Tipping, M. E., and Bishop, C. M. (1999). “Probabilistic principal

component analysis”. Journal of the Royal Statistical Society:

Series B (Statistical Methodology), 61(3), 611-622.

via the score and score_samples methods.

For svd_solver == ‘arpack’, refer to scipy.sparse.linalg.svds.

For svd_solver == ‘randomized’, see:

Halko, N., Martinsson, P. G., and Tropp, J. A. (2011).

“Finding structure with randomness: Probabilistic algorithms for

constructing approximate matrix decompositions”.

SIAM review, 53(2), 217-288.

and also

Martinsson, P. G., Rokhlin, V., and Tygert, M. (2011).

“A randomized algorithm for the decomposition of matrices”.

Applied and Computational Harmonic Analysis, 30(1), 47-68.

Examples

>>> import numpy as np

>>> from sklearn.decomposition import PCA

>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])

>>> pca = PCA(n_components=2)

>>> pca.fit(X)

PCA(n_components=2)

>>> print(pca.explained_variance_ratio_)

[0.9924... 0.0075...]

>>> print(pca.singular_values_)

[6.30061... 0.54980...]

>>> pca = PCA(n_components=2, svd_solver='full')

>>> pca.fit(X)

PCA(n_components=2, svd_solver='full')

>>> print(pca.explained_variance_ratio_)

[0.9924... 0.00755...]

>>> print(pca.singular_values_)

[6.30061... 0.54980...]

>>> pca = PCA(n_components=1, svd_solver='arpack')

>>> pca.fit(X)

PCA(n_components=1, svd_solver='arpack')

>>> print(pca.explained_variance_ratio_)

[0.99244...]

>>> print(pca.singular_values_)

[6.30061...]

Methods

fit(X[, y])

Fit the model with X.

fit_transform(X[, y])

Fit the model with X and apply the dimensionality reduction on X.

get_covariance()

Compute data covariance with the generative model.

get_feature_names_out([input_features])

Get output feature names for transformation.

get_metadata_routing()

Get metadata routing of this object.

get_params([deep])

Get parameters for this estimator.

get_precision()

Compute data precision matrix with the generative model.

inverse_transform(X)

Transform data back to its original space.

score(X[, y])

Return the average log-likelihood of all samples.

score_samples(X)

Return the log-likelihood of each sample.

set_output(*[, transform])

Set output container.

set_params(**params)

Set the parameters of this estimator.

transform(X)

Apply dimensionality reduction to X.

fit(X, y=None)[source]¶

Fit the model with X.

Parameters:

X{array-like, sparse matrix} of shape (n_samples, n_features)Training data, where n_samples is the number of samples

and n_features is the number of features.

yIgnoredIgnored.

Returns:

selfobjectReturns the instance itself.

fit_transform(X, y=None)[source]¶

Fit the model with X and apply the dimensionality reduction on X.

Parameters:

X{array-like, sparse matrix} of shape (n_samples, n_features)Training data, where n_samples is the number of samples

and n_features is the number of features.

yIgnoredIgnored.

Returns:

X_newndarray of shape (n_samples, n_components)Transformed values.

Notes

This method returns a Fortran-ordered array. To convert it to a

C-ordered array, use ‘np.ascontiguousarray’.

get_covariance()[source]¶

Compute data covariance with the generative model.

cov = components_.T * S**2 * components_ + sigma2 * eye(n_features)

where S**2 contains the explained variances, and sigma2 contains the

noise variances.

Returns:

covarray of shape=(n_features, n_features)Estimated covariance of data.

get_feature_names_out(input_features=None)[source]¶

Get output feature names for transformation.

The feature names out will prefixed by the lowercased class name. For

example, if the transformer outputs 3 features, then the feature names

out are: ["class_name0", "class_name1", "class_name2"].

Parameters:

input_featuresarray-like of str or None, default=NoneOnly used to validate feature names with the names seen in fit.

Returns:

feature_names_outndarray of str objectsTransformed feature names.

get_metadata_routing()[source]¶

Get metadata routing of this object.

Please check User Guide on how the routing

mechanism works.

Returns:

routingMetadataRequestA MetadataRequest encapsulating

routing information.

get_params(deep=True)[source]¶

Get parameters for this estimator.

Parameters:

deepbool, default=TrueIf True, will return the parameters for this estimator and

contained subobjects that are estimators.

Returns:

paramsdictParameter names mapped to their values.

get_precision()[source]¶

Compute data precision matrix with the generative model.

Equals the inverse of the covariance but computed with

the matrix inversion lemma for efficiency.

Returns:

precisionarray, shape=(n_features, n_features)Estimated precision of data.

inverse_transform(X)[source]¶

Transform data back to its original space.

In other words, return an input X_original whose transform would be X.

Parameters:

Xarray-like of shape (n_samples, n_components)New data, where n_samples is the number of samples

and n_components is the number of components.

Returns:

X_original array-like of shape (n_samples, n_features)Original data, where n_samples is the number of samples

and n_features is the number of features.

Notes

If whitening is enabled, inverse_transform will compute the

exact inverse operation, which includes reversing whitening.

score(X, y=None)[source]¶

Return the average log-likelihood of all samples.

See. “Pattern Recognition and Machine Learning”

by C. Bishop, 12.2.1 p. 574

or http://www.miketipping.com/papers/met-mppca.pdf

Parameters:

Xarray-like of shape (n_samples, n_features)The data.

yIgnoredIgnored.

Returns:

llfloatAverage log-likelihood of the samples under the current model.

score_samples(X)[source]¶

Return the log-likelihood of each sample.

See. “Pattern Recognition and Machine Learning”

by C. Bishop, 12.2.1 p. 574

or http://www.miketipping.com/papers/met-mppca.pdf

Parameters:

Xarray-like of shape (n_samples, n_features)The data.

Returns:

llndarray of shape (n_samples,)Log-likelihood of each sample under the current model.

set_output(*, transform=None)[source]¶

Set output container.

See Introducing the set_output API

for an example on how to use the API.

Parameters:

transform{“default”, “pandas”}, default=NoneConfigure output of transform and fit_transform.

"default": Default output format of a transformer

"pandas": DataFrame output

"polars": Polars output

None: Transform configuration is unchanged

New in version 1.4: "polars" option was added.

Returns:

selfestimator instanceEstimator instance.

set_params(**params)[source]¶

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects

(such as Pipeline). The latter have

parameters of the form __ so that it’s

possible to update each component of a nested object.

Parameters:

**paramsdictEstimator parameters.

Returns:

selfestimator instanceEstimator instance.

transform(X)[source]¶

Apply dimensionality reduction to X.

X is projected on the first principal components previously extracted

from a training set.

Parameters:

X{array-like, sparse matrix} of shape (n_samples, n_features)New data, where n_samples is the number of samples

and n_features is the number of features.

Returns:

X_newarray-like of shape (n_samples, n_components)Projection of X in the first principal components, where n_samples

is the number of samples and n_components is the number of the components.

Examples using sklearn.decomposition.PCA¶

Release Highlights for scikit-learn 1.4

A demo of K-Means clustering on the handwritten digits data

Principal Component Regression vs Partial Least Squares Regression

The Iris Dataset

Blind source separation using FastICA

Comparison of LDA and PCA 2D projection of Iris dataset

Faces dataset decompositions

Factor Analysis (with rotation) to visualize patterns

FastICA on 2D point clouds

Incremental PCA

Kernel PCA

Model selection with Probabilistic PCA and Factor Analysis (FA)

PCA example with Iris Data-set

Faces recognition example using eigenfaces and SVMs

Image denoising using kernel PCA

Multi-dimensional scaling

Displaying Pipelines

Explicit feature map approximation for RBF kernels

Multilabel classification

Balance model complexity and cross-validated score

Dimensionality Reduction with Neighborhood Components Analysis

Kernel Density Estimation

Column Transformer with Heterogeneous Data Sources

Concatenating multiple feature extraction methods

Pipelining: chaining a PCA and a logistic regression

Selecting dimensionality reduction with Pipeline and GridSearchCV

Importance of Feature Scaling

Show this page source

scikit-learn中文社区

安装

用户指南

API

案例

入门

教程

更新日志

词汇表

常见问题

交流群

scikit-learn

Machine Learning in Python

入门

0.23版本的发布要点

GitHub

交流微信群二维码

简单有效的工具进行预测数据分析

每个人都可以访问，并且可以在各种情况下重用

基于NumPy，SciPy和matplotlib构建

开源，可商业使用-BSD许可证

分类

标识对象所属的类别。

应用范围：垃圾邮件检测，图像识别。

算法：

SVM

sklearn.cluster.KMeans — scikit-learn 1.4.1 documentation

Install

User Guide

API

Examples

Community

Getting Started

Tutorial

What's new

Glossary

Development

FAQ

Support

Related packages

Roadmap

Governance

About us

GitHub

Other Versions and Download

Getting Started

Tutorial

What's new

Glossary

Development

FAQ

Support

Related packages

Roadmap

Governance

About us

GitHub

Other Versions and Download

Toggle Menu

PrevUp

scikit-learn 1.4.1

Other versions

Please cite us if you use the software.

sklearn.cluster.KMeans

KMeans

KMeans.fit

KMeans.fit_predict

KMeans.fit_transform

KMeans.get_feature_names_out

KMeans.get_metadata_routing

KMeans.get_params

KMeans.predict

KMeans.score

KMeans.set_fit_request

KMeans.set_output

KMeans.set_params

KMeans.set_predict_request

KMeans.set_score_request

KMeans.transform

Examples using sklearn.cluster.KMeans

sklearn.cluster.KMeans¶

class sklearn.cluster.KMeans(n_clusters=8, *, init='k-means++', n_init='auto', max_iter=300, tol=0.0001, verbose=0, random_state=None, copy_x=True, algorithm='lloyd')[source]¶

K-Means clustering.

Read more in the User Guide.

Parameters:

n_clustersint, default=8The number of clusters to form as well as the number of

centroids to generate.

For an example of how to choose an optimal value for n_clusters refer to

Selecting the number of clusters with silhouette analysis on KMeans clustering.

init{‘k-means++’, ‘random’}, callable or array-like of shape (n_clusters, n_features), default=’k-means++’Method for initialization:

‘k-means++’ : selects initial cluster centroids using sampling based on an empirical probability distribution of the points’ contribution to the overall inertia. This technique speeds up convergence. The algorithm implemented is “greedy k-means++”. It differs from the vanilla k-means++ by making several trials at each sampling step and choosing the best centroid among them.

‘random’: choose n_clusters observations (rows) at random from data for the initial centroids.

If an array is passed, it should be of shape (n_clusters, n_features) and gives the initial centers.

If a callable is passed, it should take arguments X, n_clusters and a random state and return an initialization.

For an example of how to use the different init strategy, see the example

entitled A demo of K-Means clustering on the handwritten digits data.

n_init‘auto’ or int, default=’auto’Number of times the k-means algorithm is run with different centroid

seeds. The final results is the best output of n_init consecutive runs

in terms of inertia. Several runs are recommended for sparse

high-dimensional problems (see Clustering sparse data with k-means).

When n_init='auto', the number of runs depends on the value of init:

10 if using init='random' or init is a callable;

1 if using init='k-means++' or init is an array-like.

New in version 1.2: Added ‘auto’ option for n_init.

Changed in version 1.4: Default value for n_init changed to 'auto'.

max_iterint, default=300Maximum number of iterations of the k-means algorithm for a

single run.

tolfloat, default=1e-4Relative tolerance with regards to Frobenius norm of the difference

in the cluster centers of two consecutive iterations to declare

convergence.

verboseint, default=0Verbosity mode.

random_stateint, RandomState instance or None, default=NoneDetermines random number generation for centroid initialization. Use

an int to make the randomness deterministic.

See Glossary.

copy_xbool, default=TrueWhen pre-computing distances it is more numerically accurate to center

the data first. If copy_x is True (default), then the original data is

not modified. If False, the original data is modified, and put back

before the function returns, but small numerical differences may be

introduced by subtracting and then adding the data mean. Note that if

the original data is not C-contiguous, a copy will be made even if

copy_x is False. If the original data is sparse, but not in CSR format,

a copy will be made even if copy_x is False.

algorithm{“lloyd”, “elkan”}, default=”lloyd”K-means algorithm to use. The classical EM-style algorithm is "lloyd".

The "elkan" variation can be more efficient on some datasets with

well-defined clusters, by using the triangle inequality. However it’s

more memory intensive due to the allocation of an extra array of shape

(n_samples, n_clusters).

Changed in version 0.18: Added Elkan algorithm

Changed in version 1.1: Renamed “full” to “lloyd”, and deprecated “auto” and “full”.

Changed “auto” to use “lloyd” instead of “elkan”.

Attributes:

cluster_centers_ndarray of shape (n_clusters, n_features)Coordinates of cluster centers. If the algorithm stops before fully

converging (see tol and max_iter), these will not be

consistent with labels_.

labels_ndarray of shape (n_samples,)Labels of each point

inertia_floatSum of squared distances of samples to their closest cluster center,

weighted by the sample weights if provided.

n_iter_intNumber of iterations run.

n_features_in_intNumber of features seen during fit.

New in version 0.24.

feature_names_in_ndarray of shape (n_features_in_,)Names of features seen during fit. Defined only when X

has feature names that are all strings.

New in version 1.0.

See also

MiniBatchKMeansAlternative online implementation that does incremental updates of the centers positions using mini-batches. For large scale learning (say n_samples > 10k) MiniBatchKMeans is probably much faster than the default batch implementation.

Notes

The k-means problem is solved using either Lloyd’s or Elkan’s algorithm.

The average complexity is given by O(k n T), where n is the number of

samples and T is the number of iteration.

The worst case complexity is given by O(n^(k+2/p)) with

n = n_samples, p = n_features.

Refer to “How slow is the k-means method?” D. Arthur and S. Vassilvitskii -

SoCG2006. for more details.

In practice, the k-means algorithm is very fast (one of the fastest

clustering algorithms available), but it falls in local minima. That’s why

it can be useful to restart it several times.

If the algorithm stops before fully converging (because of tol or

max_iter), labels_ and cluster_centers_ will not be consistent,

i.e. the cluster_centers_ will not be the means of the points in each

cluster. Also, the estimator will reassign labels_ after the last

iteration to make labels_ consistent with predict on the training

set.

Examples

>>> from sklearn.cluster import KMeans

>>> import numpy as np

>>> X = np.array([[1, 2], [1, 4], [1, 0],

... [10, 2], [10, 4], [10, 0]])

>>> kmeans = KMeans(n_clusters=2, random_state=0, n_init="auto").fit(X)

>>> kmeans.labels_

array([1, 1, 1, 0, 0, 0], dtype=int32)

>>> kmeans.predict([[0, 0], [12, 3]])

array([1, 0], dtype=int32)

>>> kmeans.cluster_centers_

array([[10., 2.],

[ 1., 2.]])

For a more detailed example of K-Means using the iris dataset see

K-means Clustering.

For examples of common problems with K-Means and how to address them see

Demonstration of k-means assumptions.

For an example of how to use K-Means to perform color quantization see

Color Quantization using K-Means.

For a demonstration of how K-Means can be used to cluster text documents see

Clustering text documents using k-means.

For a comparison between K-Means and MiniBatchKMeans refer to example

Comparison of the K-Means and MiniBatchKMeans clustering algorithms.

Methods

fit(X[, y, sample_weight])

Compute k-means clustering.

fit_predict(X[, y, sample_weight])

Compute cluster centers and predict cluster index for each sample.

fit_transform(X[, y, sample_weight])

Compute clustering and transform X to cluster-distance space.

get_feature_names_out([input_features])

Get output feature names for transformation.

get_metadata_routing()

Get metadata routing of this object.

get_params([deep])

Get parameters for this estimator.

predict(X[, sample_weight])

Predict the closest cluster each sample in X belongs to.

score(X[, y, sample_weight])

Opposite of the value of X on the K-means objective.

set_fit_request(*[, sample_weight])

Request metadata passed to the fit method.

set_output(*[, transform])

Set output container.

set_params(**params)

Set the parameters of this estimator.

set_predict_request(*[, sample_weight])

Request metadata passed to the predict method.

set_score_request(*[, sample_weight])

Request metadata passed to the score method.

transform(X)

Transform X to a cluster-distance space.

fit(X, y=None, sample_weight=None)[source]¶

Compute k-means clustering.

Parameters:

X{array-like, sparse matrix} of shape (n_samples, n_features)Training instances to cluster. It must be noted that the data

will be converted to C ordering, which will cause a memory

copy if the given data is not C-contiguous.

If a sparse matrix is passed, a copy will be made if it’s not in

CSR format.

yIgnoredNot used, present here for API consistency by convention.

sample_weightarray-like of shape (n_samples,), default=NoneThe weights for each observation in X. If None, all observations

are assigned equal weight. sample_weight is not used during

initialization if init is a callable or a user provided array.

New in version 0.20.

Returns:

selfobjectFitted estimator.

fit_predict(X, y=None, sample_weight=None)[source]¶

Compute cluster centers and predict cluster index for each sample.

Convenience method; equivalent to calling fit(X) followed by

predict(X).

Parameters:

X{array-like, sparse matrix} of shape (n_samples, n_features)New data to transform.

yIgnoredNot used, present here for API consistency by convention.

sample_weightarray-like of shape (n_samples,), default=NoneThe weights for each observation in X. If None, all observations

are assigned equal weight.

Returns:

labelsndarray of shape (n_samples,)Index of the cluster each sample belongs to.

fit_transform(X, y=None, sample_weight=None)[source]¶

Compute clustering and transform X to cluster-distance space.

Equivalent to fit(X).transform(X), but more efficiently implemented.

Parameters:

X{array-like, sparse matrix} of shape (n_samples, n_features)New data to transform.

yIgnoredNot used, present here for API consistency by convention.

sample_weightarray-like of shape (n_samples,), default=NoneThe weights for each observation in X. If None, all observations

are assigned equal weight.

Returns:

X_newndarray of shape (n_samples, n_clusters)X transformed in the new space.

get_feature_names_out(input_features=None)[source]¶

Get output feature names for transformation.

The feature names out will prefixed by the lowercased class name. For

example, if the transformer outputs 3 features, then the feature names

out are: ["class_name0", "class_name1", "class_name2"].

Parameters:

input_featuresarray-like of str or None, default=NoneOnly used to validate feature names with the names seen in fit.

Returns:

feature_names_outndarray of str objectsTransformed feature names.

get_metadata_routing()[source]¶

Get metadata routing of this object.

Please check User Guide on how the routing

mechanism works.

Returns:

routingMetadataRequestA MetadataRequest encapsulating

routing information.

get_params(deep=True)[source]¶

Get parameters for this estimator.

Parameters:

deepbool, default=TrueIf True, will return the parameters for this estimator and

contained subobjects that are estimators.

Returns:

paramsdictParameter names mapped to their values.

predict(X, sample_weight='deprecated')[source]¶

Predict the closest cluster each sample in X belongs to.

In the vector quantization literature, cluster_centers_ is called

the code book and each value returned by predict is the index of

the closest code in the code book.

Parameters:

X{array-like, sparse matrix} of shape (n_samples, n_features)New data to predict.

sample_weightarray-like of shape (n_samples,), default=NoneThe weights for each observation in X. If None, all observations

are assigned equal weight.

Deprecated since version 1.3: The parameter sample_weight is deprecated in version 1.3

and will be removed in 1.5.

Returns:

labelsndarray of shape (n_samples,)Index of the cluster each sample belongs to.

score(X, y=None, sample_weight=None)[source]¶

Opposite of the value of X on the K-means objective.

Parameters:

X{array-like, sparse matrix} of shape (n_samples, n_features)New data.

yIgnoredNot used, present here for API consistency by convention.

sample_weightarray-like of shape (n_samples,), default=NoneThe weights for each observation in X. If None, all observations

are assigned equal weight.

Returns:

scorefloatOpposite of the value of X on the K-means objective.

set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') → KMeans[source]¶

Request metadata passed to the fit method.

Note that this method is only relevant if

enable_metadata_routing=True (see sklearn.set_config).

Please see User Guide on how the routing

mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

False: metadata is not requested and the meta-estimator will not pass it to fit.

None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the

existing request. This allows you to change the request for some

parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a

sub-estimator of a meta-estimator, e.g. used inside a

Pipeline. Otherwise it has no effect.

Parameters:

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGEDMetadata routing for sample_weight parameter in fit.

Returns:

selfobjectThe updated object.

set_output(*, transform=None)[source]¶

Set output container.

See Introducing the set_output API

for an example on how to use the API.

Parameters:

transform{“default”, “pandas”}, default=NoneConfigure output of transform and fit_transform.

"default": Default output format of a transformer

"pandas": DataFrame output

"polars": Polars output

None: Transform configuration is unchanged

New in version 1.4: "polars" option was added.

Returns:

selfestimator instanceEstimator instance.

set_params(**params)[source]¶

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects

(such as Pipeline). The latter have

parameters of the form __ so that it’s

possible to update each component of a nested object.

Parameters:

**paramsdictEstimator parameters.

Returns:

selfestimator instanceEstimator instance.

set_predict_request(*, sample_weight: bool | None | str = '$UNCHANGED$') → KMeans[source]¶

Request metadata passed to the predict method.

Note that this method is only relevant if

enable_metadata_routing=True (see sklearn.set_config).

Please see User Guide on how the routing

mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to predict if provided. The request is ignored if metadata is not provided.

False: metadata is not requested and the meta-estimator will not pass it to predict.

None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the

existing request. This allows you to change the request for some

parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a

sub-estimator of a meta-estimator, e.g. used inside a

Pipeline. Otherwise it has no effect.

Parameters:

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGEDMetadata routing for sample_weight parameter in predict.

Returns:

selfobjectThe updated object.

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') → KMeans[source]¶

Request metadata passed to the score method.

Note that this method is only relevant if

enable_metadata_routing=True (see sklearn.set_config).

Please see User Guide on how the routing

mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

False: metadata is not requested and the meta-estimator will not pass it to score.

None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the

existing request. This allows you to change the request for some

parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a

sub-estimator of a meta-estimator, e.g. used inside a

Pipeline. Otherwise it has no effect.

Parameters:

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGEDMetadata routing for sample_weight parameter in score.

Returns:

selfobjectThe updated object.

transform(X)[source]¶

Transform X to a cluster-distance space.

In the new space, each dimension is the distance to the cluster

centers. Note that even if X is sparse, the array returned by

transform will typically be dense.

Parameters:

X{array-like, sparse matrix} of shape (n_samples, n_features)New data to transform.

Returns:

X_newndarray of shape (n_samples, n_clusters)X transformed in the new space.

Examples using sklearn.cluster.KMeans¶

Release Highlights for scikit-learn 1.1

Release Highlights for scikit-learn 0.23

A demo of K-Means clustering on the handwritten digits data

Bisecting K-Means and Regular K-Means Performance Comparison

Color Quantization using K-Means

Comparison of the K-Means and MiniBatchKMeans clustering algorithms

Demonstration of k-means assumptions

Empirical evaluation of the impact of k-means initialization

K-means Clustering

Selecting the number of clusters with silhouette analysis on KMeans clustering

Clustering text documents using k-means

Show this page source

sklearn.model_selection.GridSearchCV — scikit-learn 1.4.1 documentation

Install

User Guide

API

Examples

Community

Getting Started

Tutorial

What's new

Glossary

Development

FAQ

Support

Related packages

Roadmap

Governance

About us

GitHub

Other Versions and Download

Getting Started

Tutorial

What's new

Glossary

Development

FAQ

Support

Related packages

Roadmap

Governance

About us

GitHub

Other Versions and Download

Toggle Menu

PrevUp

scikit-learn 1.4.1

Other versions

Please cite us if you use the software.

sklearn.model_selection.GridSearchCV

GridSearchCV

GridSearchCV.classes_

GridSearchCV.decision_function

GridSearchCV.fit

GridSearchCV.get_metadata_routing

GridSearchCV.get_params

GridSearchCV.inverse_transform

GridSearchCV.n_features_in_

GridSearchCV.predict

GridSearchCV.predict_log_proba

GridSearchCV.predict_proba

GridSearchCV.score

GridSearchCV.score_samples

GridSearchCV.set_params

GridSearchCV.transform

Examples using sklearn.model_selection.GridSearchCV

sklearn.model_selection.GridSearchCV¶

class sklearn.model_selection.GridSearchCV(estimator, param_grid, *, scoring=None, n_jobs=None, refit=True, cv=None, verbose=0, pre_dispatch='2*n_jobs', error_score=nan, return_train_score=False)[source]¶

Exhaustive search over specified parameter values for an estimator.

Important members are fit, predict.

GridSearchCV implements a “fit” and a “score” method.

It also implements “score_samples”, “predict”, “predict_proba”,

“decision_function”, “transform” and “inverse_transform” if they are

implemented in the estimator used.

The parameters of the estimator used to apply these methods are optimized

by cross-validated grid-search over a parameter grid.

Read more in the User Guide.

Parameters:

estimatorestimator objectThis is assumed to implement the scikit-learn estimator interface.

Either estimator needs to provide a score function,

or scoring must be passed.

param_griddict or list of dictionariesDictionary with parameters names (str) as keys and lists of

parameter settings to try as values, or a list of such

dictionaries, in which case the grids spanned by each dictionary

in the list are explored. This enables searching over any sequence

of parameter settings.

scoringstr, callable, list, tuple or dict, default=NoneStrategy to evaluate the performance of the cross-validated model on

the test set.

If scoring represents a single score, one can use:

a single string (see The scoring parameter: defining model evaluation rules);

a callable (see Defining your scoring strategy from metric functions) that returns a single value.

If scoring represents multiple scores, one can use:

a list or tuple of unique strings;

a callable returning a dictionary where the keys are the metric

names and the values are the metric scores;

a dictionary with metric names as keys and callables a values.

See Specifying multiple metrics for evaluation for an example.

n_jobsint, default=NoneNumber of jobs to run in parallel.

None means 1 unless in a joblib.parallel_backend context.

-1 means using all processors. See Glossary

for more details.

Changed in version v0.20: n_jobs default changed from 1 to None

refitbool, str, or callable, default=TrueRefit an estimator using the best found parameters on the whole

dataset.

For multiple metric evaluation, this needs to be a str denoting the

scorer that would be used to find the best parameters for refitting

the estimator at the end.

Where there are considerations other than maximum score in

choosing a best estimator, refit can be set to a function which

returns the selected best_index_ given cv_results_. In that

case, the best_estimator_ and best_params_ will be set

according to the returned best_index_ while the best_score_

attribute will not be available.

The refitted estimator is made available at the best_estimator_

attribute and permits using predict directly on this

GridSearchCV instance.

Also for multiple metric evaluation, the attributes best_index_,

best_score_ and best_params_ will only be available if

refit is set and all of them will be determined w.r.t this specific

scorer.

See scoring parameter to know more about multiple metric

evaluation.

See Custom refit strategy of a grid search with cross-validation

to see how to design a custom selection strategy using a callable

via refit.

Changed in version 0.20: Support for callable added.

cvint, cross-validation generator or an iterable, default=NoneDetermines the cross-validation splitting strategy.

Possible inputs for cv are:

None, to use the default 5-fold cross validation,

integer, to specify the number of folds in a (Stratified)KFold,

CV splitter,

An iterable yielding (train, test) splits as arrays of indices.

For integer/None inputs, if the estimator is a classifier and y is

either binary or multiclass, StratifiedKFold is used. In all

other cases, KFold is used. These splitters are instantiated

with shuffle=False so the splits will be the same across calls.

Refer User Guide for the various

cross-validation strategies that can be used here.

Changed in version 0.22: cv default value if None changed from 3-fold to 5-fold.

verboseintControls the verbosity: the higher, the more messages.

>1 : the computation time for each fold and parameter candidate is

displayed;

>2 : the score is also displayed;

>3 : the fold and candidate parameter indexes are also displayed

together with the starting time of the computation.

pre_dispatchint, or str, default=’2*n_jobs’Controls the number of jobs that get dispatched during parallel

execution. Reducing this number can be useful to avoid an

explosion of memory consumption when more jobs get dispatched

than CPUs can process. This parameter can be:

None, in which case all the jobs are immediately

created and spawned. Use this for lightweight and

fast-running jobs, to avoid delays due to on-demand

spawning of the jobs

An int, giving the exact number of total jobs that are

spawned

A str, giving an expression as a function of n_jobs,

as in ‘2*n_jobs’

error_score‘raise’ or numeric, default=np.nanValue to assign to the score if an error occurs in estimator fitting.

If set to ‘raise’, the error is raised. If a numeric value is given,

FitFailedWarning is raised. This parameter does not affect the refit

step, which will always raise the error.

return_train_scorebool, default=FalseIf False, the cv_results_ attribute will not include training

scores.

Computing training scores is used to get insights on how different

parameter settings impact the overfitting/underfitting trade-off.

However computing the scores on the training set can be computationally

expensive and is not strictly required to select the parameters that

yield the best generalization performance.

New in version 0.19.

Changed in version 0.21: Default value was changed from True to False

Attributes:

cv_results_dict of numpy (masked) ndarraysA dict with keys as column headers and values as columns, that can be

imported into a pandas DataFrame.

For instance the below given table

param_kernel

param_gamma

param_degree

split0_test_score

…

rank_t…

‘poly’

–

0.80

…

‘poly’

–

0.70

…

‘rbf’

0.1

–

0.80

…

‘rbf’

0.2

–

0.93

…

will be represented by a cv_results_ dict of:

{

'param_kernel': masked_array(data = ['poly', 'poly', 'rbf', 'rbf'],

mask = [False False False False]...)

'param_gamma': masked_array(data = [-- -- 0.1 0.2],

mask = [ True True False False]...),

'param_degree': masked_array(data = [2.0 3.0 -- --],

mask = [False False True True]...),

'split0_test_score' : [0.80, 0.70, 0.80, 0.93],

'split1_test_score' : [0.82, 0.50, 0.70, 0.78],

'mean_test_score' : [0.81, 0.60, 0.75, 0.85],

'std_test_score' : [0.01, 0.10, 0.05, 0.08],

'rank_test_score' : [2, 4, 3, 1],

'split0_train_score' : [0.80, 0.92, 0.70, 0.93],

'split1_train_score' : [0.82, 0.55, 0.70, 0.87],

'mean_train_score' : [0.81, 0.74, 0.70, 0.90],

'std_train_score' : [0.01, 0.19, 0.00, 0.03],

'mean_fit_time' : [0.73, 0.63, 0.43, 0.49],

'std_fit_time' : [0.01, 0.02, 0.01, 0.01],

'mean_score_time' : [0.01, 0.06, 0.04, 0.04],

'std_score_time' : [0.00, 0.00, 0.00, 0.01],

'params' : [{'kernel': 'poly', 'degree': 2}, ...],

}

NOTE

The key 'params' is used to store a list of parameter

settings dicts for all the parameter candidates.

The mean_fit_time, std_fit_time, mean_score_time and

std_score_time are all in seconds.

For multi-metric evaluation, the scores for all the scorers are

available in the cv_results_ dict at the keys ending with that

scorer’s name ('_') instead of '_score' shown

above. (‘split0_test_precision’, ‘mean_train_precision’ etc.)

best_estimator_estimatorEstimator that was chosen by the search, i.e. estimator

which gave highest score (or smallest loss if specified)

on the left out data. Not available if refit=False.

See refit parameter for more information on allowed values.

best_score_floatMean cross-validated score of the best_estimator

For multi-metric evaluation, this is present only if refit is

specified.

This attribute is not available if refit is a function.

best_params_dictParameter setting that gave the best results on the hold out data.

For multi-metric evaluation, this is present only if refit is

specified.

best_index_intThe index (of the cv_results_ arrays) which corresponds to the best

candidate parameter setting.

The dict at search.cv_results_['params'][search.best_index_] gives

the parameter setting for the best model, that gives the highest

mean score (search.best_score_).

For multi-metric evaluation, this is present only if refit is

specified.

scorer_function or a dictScorer function used on the held out data to choose the best

parameters for the model.

For multi-metric evaluation, this attribute holds the validated

scoring dict which maps the scorer key to the scorer callable.

n_splits_intThe number of cross-validation splits (folds/iterations).

refit_time_floatSeconds used for refitting the best model on the whole dataset.

This is present only if refit is not False.

New in version 0.20.

multimetric_boolWhether or not the scorers compute several metrics.

classes_ndarray of shape (n_classes,)Class labels.

n_features_in_intNumber of features seen during fit.

feature_names_in_ndarray of shape (n_features_in_,)Names of features seen during fit. Only defined if

best_estimator_ is defined (see the documentation for the refit

parameter for more details) and that best_estimator_ exposes

feature_names_in_ when fit.

New in version 1.0.

See also

ParameterGridGenerates all the combinations of a hyperparameter grid.

train_test_splitUtility function to split the data into a development set usable for fitting a GridSearchCV instance and an evaluation set for its final evaluation.

sklearn.metrics.make_scorerMake a scorer from a performance metric or loss function.

Notes

The parameters selected are those that maximize the score of the left out

data, unless an explicit score is passed in which case it is used instead.

If n_jobs was set to a value higher than one, the data is copied for each

point in the grid (and not n_jobs times). This is done for efficiency

reasons if individual jobs take very little time, but may raise errors if

the dataset is large and not enough memory is available. A workaround in

this case is to set pre_dispatch. Then, the memory is copied only

pre_dispatch many times. A reasonable value for pre_dispatch is 2 *

n_jobs.

Examples

>>> from sklearn import svm, datasets

>>> from sklearn.model_selection import GridSearchCV

>>> iris = datasets.load_iris()

>>> parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}

>>> svc = svm.SVC()

>>> clf = GridSearchCV(svc, parameters)

>>> clf.fit(iris.data, iris.target)

GridSearchCV(estimator=SVC(),

param_grid={'C': [1, 10], 'kernel': ('linear', 'rbf')})

>>> sorted(clf.cv_results_.keys())

['mean_fit_time', 'mean_score_time', 'mean_test_score',...

'param_C', 'param_kernel', 'params',...

'rank_test_score', 'split0_test_score',...

'split2_test_score', ...

'std_fit_time', 'std_score_time', 'std_test_score']

Methods

decision_function(X)

Call decision_function on the estimator with the best found parameters.

fit(X[, y])

Run fit with all sets of parameters.

get_metadata_routing()

Get metadata routing of this object.

get_params([deep])

Get parameters for this estimator.

inverse_transform(Xt)

Call inverse_transform on the estimator with the best found params.

predict(X)

Call predict on the estimator with the best found parameters.

predict_log_proba(X)

Call predict_log_proba on the estimator with the best found parameters.

predict_proba(X)

Call predict_proba on the estimator with the best found parameters.

score(X[, y])

Return the score on the given data, if the estimator has been refit.

score_samples(X)

Call score_samples on the estimator with the best found parameters.

set_params(**params)

Set the parameters of this estimator.

transform(X)

Call transform on the estimator with the best found parameters.

property classes_¶

Class labels.

Only available when refit=True and the estimator is a classifier.

decision_function(X)[source]¶

Call decision_function on the estimator with the best found parameters.

Only available if refit=True and the underlying estimator supports

decision_function.

Parameters:

Xindexable, length n_samplesMust fulfill the input assumptions of the

underlying estimator.

Returns:

y_scorendarray of shape (n_samples,) or (n_samples, n_classes) or (n_samples, n_classes * (n_classes-1) / 2)Result of the decision function for X based on the estimator with

the best found parameters.

fit(X, y=None, **params)[source]¶

Run fit with all sets of parameters.

Parameters:

Xarray-like of shape (n_samples, n_features)Training vector, where n_samples is the number of samples and

n_features is the number of features.

yarray-like of shape (n_samples, n_output) or (n_samples,), default=NoneTarget relative to X for classification or regression;

None for unsupervised learning.

**paramsdict of str -> objectParameters passed to the fit method of the estimator, the scorer,

and the CV splitter.

If a fit parameter is an array-like whose length is equal to

num_samples then it will be split across CV groups along with X

and y. For example, the sample_weight parameter is split

because len(sample_weights) = len(X).

Returns:

selfobjectInstance of fitted estimator.

get_metadata_routing()[source]¶

Get metadata routing of this object.

Please check User Guide on how the routing

mechanism works.

New in version 1.4.

Returns:

routingMetadataRouterA MetadataRouter encapsulating

routing information.

get_params(deep=True)[source]¶

Get parameters for this estimator.

Parameters:

deepbool, default=TrueIf True, will return the parameters for this estimator and

contained subobjects that are estimators.

Returns:

paramsdictParameter names mapped to their values.

inverse_transform(Xt)[source]¶

Call inverse_transform on the estimator with the best found params.

Only available if the underlying estimator implements

inverse_transform and refit=True.

Parameters:

Xtindexable, length n_samplesMust fulfill the input assumptions of the

underlying estimator.

Returns:

X{ndarray, sparse matrix} of shape (n_samples, n_features)Result of the inverse_transform function for Xt based on the

estimator with the best found parameters.

property n_features_in_¶

Number of features seen during fit.

Only available when refit=True.

predict(X)[source]¶

Call predict on the estimator with the best found parameters.

Only available if refit=True and the underlying estimator supports

predict.

Parameters:

Xindexable, length n_samplesMust fulfill the input assumptions of the

underlying estimator.

Returns:

y_predndarray of shape (n_samples,)The predicted labels or values for X based on the estimator with

the best found parameters.

predict_log_proba(X)[source]¶

Call predict_log_proba on the estimator with the best found parameters.

Only available if refit=True and the underlying estimator supports

predict_log_proba.

Parameters:

Xindexable, length n_samplesMust fulfill the input assumptions of the

underlying estimator.

Returns:

y_predndarray of shape (n_samples,) or (n_samples, n_classes)Predicted class log-probabilities for X based on the estimator

with the best found parameters. The order of the classes

corresponds to that in the fitted attribute classes_.

predict_proba(X)[source]¶

Call predict_proba on the estimator with the best found parameters.

Only available if refit=True and the underlying estimator supports

predict_proba.

Parameters:

Xindexable, length n_samplesMust fulfill the input assumptions of the

underlying estimator.

Returns:

y_predndarray of shape (n_samples,) or (n_samples, n_classes)Predicted class probabilities for X based on the estimator with

the best found parameters. The order of the classes corresponds

to that in the fitted attribute classes_.

score(X, y=None, **params)[source]¶

Return the score on the given data, if the estimator has been refit.

This uses the score defined by scoring where provided, and the

best_estimator_.score method otherwise.

Parameters:

Xarray-like of shape (n_samples, n_features)Input data, where n_samples is the number of samples and

n_features is the number of features.

yarray-like of shape (n_samples, n_output) or (n_samples,), default=NoneTarget relative to X for classification or regression;

None for unsupervised learning.

**paramsdictParameters to be passed to the underlying scorer(s).

..versionadded:: 1.4Only available if enable_metadata_routing=True. See

Metadata Routing User Guide for more

details.

Returns:

scorefloatThe score defined by scoring if provided, and the

best_estimator_.score method otherwise.

score_samples(X)[source]¶

Call score_samples on the estimator with the best found parameters.

Only available if refit=True and the underlying estimator supports

score_samples.

New in version 0.24.

Parameters:

XiterableData to predict on. Must fulfill input requirements

of the underlying estimator.

Returns:

y_scorendarray of shape (n_samples,)The best_estimator_.score_samples method.

set_params(**params)[source]¶

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects

(such as Pipeline). The latter have

parameters of the form __ so that it’s

possible to update each component of a nested object.

Parameters:

**paramsdictEstimator parameters.

Returns:

selfestimator instanceEstimator instance.

transform(X)[source]¶

Call transform on the estimator with the best found parameters.

Only available if the underlying estimator supports transform and

refit=True.

Parameters:

Xindexable, length n_samplesMust fulfill the input assumptions of the

underlying estimator.

Returns:

Xt{ndarray, sparse matrix} of shape (n_samples, n_features)X transformed in the new space based on the estimator with

the best found parameters.

Examples using sklearn.model_selection.GridSearchCV¶

Release Highlights for scikit-learn 1.4

Release Highlights for scikit-learn 0.24

Feature agglomeration vs. univariate selection

Shrinkage covariance estimation: LedoitWolf vs OAS and max-likelihood

Model selection with Probabilistic PCA and Factor Analysis (FA)

Comparing Random Forests and Histogram Gradient Boosting models

Gaussian Mixture Model Selection

Comparison of kernel ridge regression and SVR

Displaying Pipelines

Balance model complexity and cross-validated score

Comparing randomized search and grid search for hyperparameter estimation

Comparison between grid search and successive halving

Custom refit strategy of a grid search with cross-validation

Demonstration of multi-metric evaluation on cross_val_score and GridSearchCV

Nested versus non-nested cross-validation

Sample pipeline for text feature extraction and evaluation

Statistical comparison of models using grid search

Overview of multiclass training meta-estimators

Caching nearest neighbors

Kernel Density Estimation

Column Transformer with Mixed Types

Concatenating multiple feature extraction methods

Pipelining: chaining a PCA and a logistic regression

Selecting dimensionality reduction with Pipeline and GridSearchCV

Feature discretization

Plot classification boundaries with different SVM Kernels

RBF SVM parameters

Cross-validation on diabetes Dataset Exercise

Show this page source

2.3. Clustering — scikit-learn 1.4.1 documentation

Install

User Guide

API

Examples

Community

Getting Started

Tutorial

What's new

Glossary

Development

FAQ

Support

Related packages

Roadmap

Governance

About us

GitHub

Other Versions and Download

Getting Started

Tutorial

What's new

Glossary

Development

FAQ

Support

Related packages

Roadmap

Governance

About us

GitHub

Other Versions and Download

Toggle Menu

PrevUp

scikit-learn 1.4.1

Other versions

Please cite us if you use the software.

2.3. Clustering

2.3.1. Overview of clustering methods

2.3.2. K-means

2.3.2.1. Low-level parallelism

2.3.2.2. Mini Batch K-Means

2.3.3. Affinity Propagation

2.3.4. Mean Shift

2.3.5. Spectral clustering

2.3.5.1. Different label assignment strategies

2.3.5.2. Spectral Clustering Graphs

2.3.6. Hierarchical clustering

2.3.6.1. Different linkage type: Ward, complete, average, and single linkage

2.3.6.2. Visualization of cluster hierarchy

2.3.6.3. Adding connectivity constraints

2.3.6.4. Varying the metric

2.3.6.5. Bisecting K-Means

2.3.7. DBSCAN

2.3.8. HDBSCAN

2.3.8.1. Mutual Reachability Graph

2.3.8.2. Hierarchical Clustering

2.3.9. OPTICS

2.3.10. BIRCH

2.3.11. Clustering performance evaluation

2.3.11.1. Rand index

2.3.11.1.1. Advantages

2.3.11.1.2. Drawbacks

2.3.11.1.3. Mathematical formulation

2.3.11.2. Mutual Information based scores

2.3.11.2.1. Advantages

2.3.11.2.2. Drawbacks

2.3.11.2.3. Mathematical formulation

2.3.11.3. Homogeneity, completeness and V-measure

2.3.11.3.1. Advantages

2.3.11.3.2. Drawbacks

2.3.11.3.3. Mathematical formulation

2.3.11.4. Fowlkes-Mallows scores

2.3.11.4.1. Advantages

2.3.11.4.2. Drawbacks

2.3.11.5. Silhouette Coefficient

2.3.11.5.1. Advantages

2.3.11.5.2. Drawbacks

2.3.11.6. Calinski-Harabasz Index

2.3.11.6.1. Advantages

2.3.11.6.2. Drawbacks

2.3.11.6.3. Mathematical formulation

2.3.11.7. Davies-Bouldin Index

2.3.11.7.1. Advantages

2.3.11.7.2. Drawbacks

2.3.11.7.3. Mathematical formulation

2.3.11.8. Contingency Matrix

2.3.11.8.1. Advantages

2.3.11.8.2. Drawbacks

2.3.11.9. Pair Confusion Matrix

2.3. Clustering¶

Clustering of

unlabeled data can be performed with the module sklearn.cluster.

Each clustering algorithm comes in two variants: a class, that implements

the fit method to learn the clusters on train data, and a function,

that, given train data, returns an array of integer labels corresponding

to the different clusters. For the class, the labels over the training

data can be found in the labels_ attribute.

Input data

One important thing to note is that the algorithms implemented in

this module can take different kinds of matrix as input. All the

methods accept standard data matrices of shape (n_samples, n_features).

These can be obtained from the classes in the sklearn.feature_extraction

module. For AffinityPropagation, SpectralClustering

and DBSCAN one can also input similarity matrices of shape

(n_samples, n_samples). These can be obtained from the functions

in the sklearn.metrics.pairwise module.

2.3.1. Overview of clustering methods¶

A comparison of the clustering algorithms in scikit-learn¶

Method name

Parameters

Scalability

Usecase

Geometry (metric used)

K-Means

number of clusters

Very large n_samples, medium n_clusters with

MiniBatch code

General-purpose, even cluster size, flat geometry,

not too many clusters, inductive

Distances between points

Affinity propagation

damping, sample preference

Not scalable with n_samples

Many clusters, uneven cluster size, non-flat geometry, inductive

Graph distance (e.g. nearest-neighbor graph)

Mean-shift

bandwidth

Not scalable with n_samples

Many clusters, uneven cluster size, non-flat geometry, inductive

Distances between points

Spectral clustering

number of clusters

Medium n_samples, small n_clusters

Few clusters, even cluster size, non-flat geometry, transductive

Graph distance (e.g. nearest-neighbor graph)

Ward hierarchical clustering

number of clusters or distance threshold

Large n_samples and n_clusters

Many clusters, possibly connectivity constraints, transductive

Distances between points

Agglomerative clustering

number of clusters or distance threshold, linkage type, distance

Large n_samples and n_clusters

Many clusters, possibly connectivity constraints, non Euclidean

distances, transductive

Any pairwise distance

DBSCAN

neighborhood size

Very large n_samples, medium n_clusters

Non-flat geometry, uneven cluster sizes, outlier removal,

transductive

Distances between nearest points

HDBSCAN

minimum cluster membership, minimum point neighbors

large n_samples, medium n_clusters

Non-flat geometry, uneven cluster sizes, outlier removal,

transductive, hierarchical, variable cluster density

Distances between nearest points

OPTICS

minimum cluster membership

Very large n_samples, large n_clusters

Non-flat geometry, uneven cluster sizes, variable cluster density,

outlier removal, transductive

Distances between points

Gaussian mixtures

many

Not scalable

Flat geometry, good for density estimation, inductive

Mahalanobis distances to centers

BIRCH

branching factor, threshold, optional global clusterer.

Large n_clusters and n_samples

Large dataset, outlier removal, data reduction, inductive

Euclidean distance between points

Bisecting K-Means

number of clusters

Very large n_samples, medium n_clusters

General-purpose, even cluster size, flat geometry,

no empty clusters, inductive, hierarchical

Distances between points

Non-flat geometry clustering is useful when the clusters have a specific

shape, i.e. a non-flat manifold, and the standard euclidean distance is

not the right metric. This case arises in the two top rows of the figure

above.

Gaussian mixture models, useful for clustering, are described in

another chapter of the documentation dedicated to

mixture models. KMeans can be seen as a special case of Gaussian mixture

model with equal covariance per component.

Transductive clustering methods (in contrast to

inductive clustering methods) are not designed to be applied to new,

unseen data.

2.3.2. K-means¶

The KMeans algorithm clusters data by trying to separate samples in n

groups of equal variance, minimizing a criterion known as the inertia or

within-cluster sum-of-squares (see below). This algorithm requires the number

of clusters to be specified. It scales well to large numbers of samples and has

been used across a large range of application areas in many different fields.

The k-means algorithm divides a set of $N$ samples $X$ into

$K$ disjoint clusters $C$, each described by the mean $\mu_j$

of the samples in the cluster. The means are commonly called the cluster

“centroids”; note that they are not, in general, points from $X$,

although they live in the same space.

The K-means algorithm aims to choose centroids that minimise the inertia,

or within-cluster sum-of-squares criterion:

\[\sum_{i=0}^{n}\min_{\mu_j \in C}(||x_i - \mu_j||^2)\]

Inertia can be recognized as a measure of how internally coherent clusters are.

It suffers from various drawbacks:

Inertia makes the assumption that clusters are convex and isotropic,

which is not always the case. It responds poorly to elongated clusters,

or manifolds with irregular shapes.

Inertia is not a normalized metric: we just know that lower values are

better and zero is optimal. But in very high-dimensional spaces, Euclidean

distances tend to become inflated

(this is an instance of the so-called “curse of dimensionality”).

Running a dimensionality reduction algorithm such as Principal component analysis (PCA) prior to

k-means clustering can alleviate this problem and speed up the

computations.

For more detailed descriptions of the issues shown above and how to address them,

refer to the examples Demonstration of k-means assumptions

and Selecting the number of clusters with silhouette analysis on KMeans clustering.

K-means is often referred to as Lloyd’s algorithm. In basic terms, the

algorithm has three steps. The first step chooses the initial centroids, with

the most basic method being to choose $k$ samples from the dataset

$X$. After initialization, K-means consists of looping between the

two other steps. The first step assigns each sample to its nearest centroid.

The second step creates new centroids by taking the mean value of all of the

samples assigned to each previous centroid. The difference between the old

and the new centroids are computed and the algorithm repeats these last two

steps until this value is less than a threshold. In other words, it repeats

until the centroids do not move significantly.

K-means is equivalent to the expectation-maximization algorithm

with a small, all-equal, diagonal covariance matrix.

The algorithm can also be understood through the concept of Voronoi diagrams. First the Voronoi diagram of

the points is calculated using the current centroids. Each segment in the

Voronoi diagram becomes a separate cluster. Secondly, the centroids are updated

to the mean of each segment. The algorithm then repeats this until a stopping

criterion is fulfilled. Usually, the algorithm stops when the relative decrease

in the objective function between iterations is less than the given tolerance

value. This is not the case in this implementation: iteration stops when

centroids move less than the tolerance.

Given enough time, K-means will always converge, however this may be to a local

minimum. This is highly dependent on the initialization of the centroids.

As a result, the computation is often done several times, with different

initializations of the centroids. One method to help address this issue is the

k-means++ initialization scheme, which has been implemented in scikit-learn

(use the init='k-means++' parameter). This initializes the centroids to be

(generally) distant from each other, leading to probably better results than

random initialization, as shown in the reference. For a detailed example of

comaparing different initialization schemes, refer to

A demo of K-Means clustering on the handwritten digits data.

K-means++ can also be called independently to select seeds for other

clustering algorithms, see sklearn.cluster.kmeans_plusplus for details

and example usage.

The algorithm supports sample weights, which can be given by a parameter

sample_weight. This allows to assign more weight to some samples when

computing cluster centers and values of inertia. For example, assigning a

weight of 2 to a sample is equivalent to adding a duplicate of that sample

to the dataset $X$.

K-means can be used for vector quantization. This is achieved using the

transform method of a trained model of KMeans. For an example of

performing vector quantization on an image refer to

Color Quantization using K-Means.

Examples:

K-means Clustering: Example usage of

KMeans using the iris dataset

Clustering text documents using k-means: Document clustering

using KMeans and MiniBatchKMeans based on sparse data

2.3.2.1. Low-level parallelism¶

KMeans benefits from OpenMP based parallelism through Cython. Small

chunks of data (256 samples) are processed in parallel, which in addition

yields a low memory footprint. For more details on how to control the number of

threads, please refer to our Parallelism notes.

Examples:

Demonstration of k-means assumptions: Demonstrating when

k-means performs intuitively and when it does not

A demo of K-Means clustering on the handwritten digits data: Clustering handwritten digits

References:

“k-means++: The advantages of careful seeding”

Arthur, David, and Sergei Vassilvitskii,

Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete

algorithms, Society for Industrial and Applied Mathematics (2007)

2.3.2.2. Mini Batch K-Means¶

The MiniBatchKMeans is a variant of the KMeans algorithm

which uses mini-batches to reduce the computation time, while still attempting

to optimise the same objective function. Mini-batches are subsets of the input

data, randomly sampled in each training iteration. These mini-batches

drastically reduce the amount of computation required to converge to a local

solution. In contrast to other algorithms that reduce the convergence time of

k-means, mini-batch k-means produces results that are generally only slightly

worse than the standard algorithm.

The algorithm iterates between two major steps, similar to vanilla k-means.

In the first step, $b$ samples are drawn randomly from the dataset, to form

a mini-batch. These are then assigned to the nearest centroid. In the second

step, the centroids are updated. In contrast to k-means, this is done on a

per-sample basis. For each sample in the mini-batch, the assigned centroid

is updated by taking the streaming average of the sample and all previous

samples assigned to that centroid. This has the effect of decreasing the

rate of change for a centroid over time. These steps are performed until

convergence or a predetermined number of iterations is reached.

MiniBatchKMeans converges faster than KMeans, but the quality

of the results is reduced. In practice this difference in quality can be quite

small, as shown in the example and cited reference.

Examples:

Comparison of the K-Means and MiniBatchKMeans clustering algorithms: Comparison of

KMeans and MiniBatchKMeans

Clustering text documents using k-means: Document clustering

using KMeans and MiniBatchKMeans based on sparse data

Online learning of a dictionary of parts of faces

References:

“Web Scale K-Means clustering”

D. Sculley, Proceedings of the 19th international conference on World

wide web (2010)

2.3.3. Affinity Propagation¶

AffinityPropagation creates clusters by sending messages between

pairs of samples until convergence. A dataset is then described using a small

number of exemplars, which are identified as those most representative of other

samples. The messages sent between pairs represent the suitability for one

sample to be the exemplar of the other, which is updated in response to the

values from other pairs. This updating happens iteratively until convergence,

at which point the final exemplars are chosen, and hence the final clustering

is given.

Affinity Propagation can be interesting as it chooses the number of

clusters based on the data provided. For this purpose, the two important

parameters are the preference, which controls how many exemplars are

used, and the damping factor which damps the responsibility and

availability messages to avoid numerical oscillations when updating these

messages.

The main drawback of Affinity Propagation is its complexity. The

algorithm has a time complexity of the order $O(N^2 T)$, where $N$

is the number of samples and $T$ is the number of iterations until

convergence. Further, the memory complexity is of the order

$O(N^2)$ if a dense similarity matrix is used, but reducible if a

sparse similarity matrix is used. This makes Affinity Propagation most

appropriate for small to medium sized datasets.

Examples:

Demo of affinity propagation clustering algorithm: Affinity

Propagation on a synthetic 2D datasets with 3 classes.

Visualizing the stock market structure Affinity Propagation on

Financial time series to find groups of companies

Algorithm description:

The messages sent between points belong to one of two categories. The first is

the responsibility $r(i, k)$,

which is the accumulated evidence that sample $k$

should be the exemplar for sample $i$.

The second is the availability $a(i, k)$

which is the accumulated evidence that sample $i$

should choose sample $k$ to be its exemplar,

and considers the values for all other samples that $k$ should

be an exemplar. In this way, exemplars are chosen by samples if they are (1)

similar enough to many samples and (2) chosen by many samples to be

representative of themselves.

More formally, the responsibility of a sample $k$

to be the exemplar of sample $i$ is given by:

\[r(i, k) \leftarrow s(i, k) - max [ a(i, k') + s(i, k') \forall k' \neq k ]\]

Where $s(i, k)$ is the similarity between samples $i$ and $k$.

The availability of sample $k$

to be the exemplar of sample $i$ is given by:

\[a(i, k) \leftarrow min [0, r(k, k) + \sum_{i'~s.t.~i' \notin \{i, k\}}{r(i', k)}]\]

To begin with, all values for $r$ and $a$ are set to zero,

and the calculation of each iterates until convergence.

As discussed above, in order to avoid numerical oscillations when updating the

messages, the damping factor $\lambda$ is introduced to iteration process:

\[r_{t+1}(i, k) = \lambda\cdot r_{t}(i, k) + (1-\lambda)\cdot r_{t+1}(i, k)\]

\[a_{t+1}(i, k) = \lambda\cdot a_{t}(i, k) + (1-\lambda)\cdot a_{t+1}(i, k)\]

where $t$ indicates the iteration times.

2.3.4. Mean Shift¶

MeanShift clustering aims to discover blobs in a smooth density of

samples. It is a centroid based algorithm, which works by updating candidates

for centroids to be the mean of the points within a given region. These

candidates are then filtered in a post-processing stage to eliminate

near-duplicates to form the final set of centroids.

The position of centroid candidates is iteratively adjusted using a technique called hill

climbing, which finds local maxima of the estimated probability density.

Given a candidate centroid $x$ for iteration $t$, the candidate

is updated according to the following equation:

\[x^{t+1} = x^t + m(x^t)\]

Where $m$ is the mean shift vector that is computed for each

centroid that points towards a region of the maximum increase in the density of points.

To compute $m$ we define $N(x)$ as the neighborhood of samples within

a given distance around $x$. Then $m$ is computed using the following

equation, effectively updating a centroid to be the mean of the samples within

its neighborhood:

\[m(x) = \frac{1}{|N(x)|} \sum_{x_j \in N(x)}x_j - x\]

In general, the equation for $m$ depends on a kernel used for density estimation.

The generic formula is:

\[m(x) = \frac{\sum_{x_j \in N(x)}K(x_j - x)x_j}{\sum_{x_j \in N(x)}K(x_j - x)} - x\]

In our implementation, $K(x)$ is equal to 1 if $x$ is small enough and is

equal to 0 otherwise. Effectively $K(y - x)$ indicates whether $y$ is in

the neighborhood of $x$.

The algorithm automatically sets the number of clusters, instead of relying on a

parameter bandwidth, which dictates the size of the region to search through.

This parameter can be set manually, but can be estimated using the provided

estimate_bandwidth function, which is called if the bandwidth is not set.

The algorithm is not highly scalable, as it requires multiple nearest neighbor

searches during the execution of the algorithm. The algorithm is guaranteed to

converge, however the algorithm will stop iterating when the change in centroids

is small.

Labelling a new sample is performed by finding the nearest centroid for a

given sample.

Examples:

A demo of the mean-shift clustering algorithm: Mean Shift clustering

on a synthetic 2D datasets with 3 classes.

References:

“Mean shift: A robust approach toward feature space analysis”

D. Comaniciu and P. Meer, IEEE Transactions on Pattern Analysis and Machine Intelligence (2002)

2.3.5. Spectral clustering¶

SpectralClustering performs a low-dimension embedding of the

affinity matrix between samples, followed by clustering, e.g., by KMeans,

of the components of the eigenvectors in the low dimensional space.

It is especially computationally efficient if the affinity matrix is sparse

and the amg solver is used for the eigenvalue problem (Note, the amg solver

requires that the pyamg module is installed.)

The present version of SpectralClustering requires the number of clusters

to be specified in advance. It works well for a small number of clusters,

but is not advised for many clusters.

For two clusters, SpectralClustering solves a convex relaxation of the

normalized cuts

problem on the similarity graph: cutting the graph in two so that the weight of

the edges cut is small compared to the weights of the edges inside each

cluster. This criteria is especially interesting when working on images, where

graph vertices are pixels, and weights of the edges of the similarity graph are

computed using a function of a gradient of the image.

Warning

Transforming distance to well-behaved similarities

Note that if the values of your similarity matrix are not well

distributed, e.g. with negative values or with a distance matrix

rather than a similarity, the spectral problem will be singular and

the problem not solvable. In which case it is advised to apply a

transformation to the entries of the matrix. For instance, in the

case of a signed distance matrix, is common to apply a heat kernel:

similarity = np.exp(-beta * distance / distance.std())

See the examples for such an application.

Examples:

Spectral clustering for image segmentation: Segmenting objects

from a noisy background using spectral clustering.

Segmenting the picture of greek coins in regions: Spectral clustering

to split the image of coins in regions.

2.3.5.1. Different label assignment strategies¶

Different label assignment strategies can be used, corresponding to the

assign_labels parameter of SpectralClustering.

"kmeans" strategy can match finer details, but can be unstable.

In particular, unless you control the random_state, it may not be

reproducible from run-to-run, as it depends on random initialization.

The alternative "discretize" strategy is 100% reproducible, but tends

to create parcels of fairly even and geometrical shape.

The recently added "cluster_qr" option is a deterministic alternative that

tends to create the visually best partitioning on the example application

below.

assign_labels="kmeans"

assign_labels="discretize"

assign_labels="cluster_qr"

References:

“Multiclass spectral clustering”

Stella X. Yu, Jianbo Shi, 2003

“Simple, direct, and efficient multi-way spectral clustering”

Anil Damle, Victor Minden, Lexing Ying, 2019

2.3.5.2. Spectral Clustering Graphs¶

Spectral Clustering can also be used to partition graphs via their spectral

embeddings. In this case, the affinity matrix is the adjacency matrix of the

graph, and SpectralClustering is initialized with affinity='precomputed':

>>> from sklearn.cluster import SpectralClustering

>>> sc = SpectralClustering(3, affinity='precomputed', n_init=100,

... assign_labels='discretize')

>>> sc.fit_predict(adjacency_matrix)

References:

“A Tutorial on Spectral Clustering”

Ulrike von Luxburg, 2007

“Normalized cuts and image segmentation”

Jianbo Shi, Jitendra Malik, 2000

“A Random Walks View of Spectral Segmentation”

Marina Meila, Jianbo Shi, 2001

“On Spectral Clustering: Analysis and an algorithm”

Andrew Y. Ng, Michael I. Jordan, Yair Weiss, 2001

“Preconditioned Spectral Clustering for Stochastic

Block Partition Streaming Graph Challenge”

David Zhuzhunashvili, Andrew Knyazev

2.3.6. Hierarchical clustering¶

Hierarchical clustering is a general family of clustering algorithms that

build nested clusters by merging or splitting them successively. This

hierarchy of clusters is represented as a tree (or dendrogram). The root of the

tree is the unique cluster that gathers all the samples, the leaves being the

clusters with only one sample. See the Wikipedia page for more details.

The AgglomerativeClustering object performs a hierarchical clustering

using a bottom up approach: each observation starts in its own cluster, and

clusters are successively merged together. The linkage criteria determines the

metric used for the merge strategy:

Ward minimizes the sum of squared differences within all clusters. It is a

variance-minimizing approach and in this sense is similar to the k-means

objective function but tackled with an agglomerative hierarchical

approach.

Maximum or complete linkage minimizes the maximum distance between

observations of pairs of clusters.

Average linkage minimizes the average of the distances between all

observations of pairs of clusters.

Single linkage minimizes the distance between the closest

observations of pairs of clusters.

AgglomerativeClustering can also scale to large number of samples

when it is used jointly with a connectivity matrix, but is computationally

expensive when no connectivity constraints are added between samples: it

considers at each step all the possible merges.

FeatureAgglomeration

The FeatureAgglomeration uses agglomerative clustering to

group together features that look very similar, thus decreasing the

number of features. It is a dimensionality reduction tool, see

Unsupervised dimensionality reduction.

2.3.6.1. Different linkage type: Ward, complete, average, and single linkage¶

AgglomerativeClustering supports Ward, single, average, and complete

linkage strategies.

Agglomerative cluster has a “rich get richer” behavior that leads to

uneven cluster sizes. In this regard, single linkage is the worst

strategy, and Ward gives the most regular sizes. However, the affinity

(or distance used in clustering) cannot be varied with Ward, thus for non

Euclidean metrics, average linkage is a good alternative. Single linkage,

while not robust to noisy data, can be computed very efficiently and can

therefore be useful to provide hierarchical clustering of larger datasets.

Single linkage can also perform well on non-globular data.

Examples:

Various Agglomerative Clustering on a 2D embedding of digits: exploration of the

different linkage strategies in a real dataset.

2.3.6.2. Visualization of cluster hierarchy¶

It’s possible to visualize the tree representing the hierarchical merging of clusters

as a dendrogram. Visual inspection can often be useful for understanding the structure

of the data, though more so in the case of small sample sizes.

2.3.6.3. Adding connectivity constraints¶

An interesting aspect of AgglomerativeClustering is that

connectivity constraints can be added to this algorithm (only adjacent

clusters can be merged together), through a connectivity matrix that defines

for each sample the neighboring samples following a given structure of the

data. For instance, in the swiss-roll example below, the connectivity

constraints forbid the merging of points that are not adjacent on the swiss

roll, and thus avoid forming clusters that extend across overlapping folds of

the roll.

These constraint are useful to impose a certain local structure, but they

also make the algorithm faster, especially when the number of the samples

is high.

The connectivity constraints are imposed via an connectivity matrix: a

scipy sparse matrix that has elements only at the intersection of a row

and a column with indices of the dataset that should be connected. This

matrix can be constructed from a-priori information: for instance, you

may wish to cluster web pages by only merging pages with a link pointing

from one to another. It can also be learned from the data, for instance

using sklearn.neighbors.kneighbors_graph to restrict

merging to nearest neighbors as in this example, or

using sklearn.feature_extraction.image.grid_to_graph to

enable only merging of neighboring pixels on an image, as in the

coin example.

Examples:

A demo of structured Ward hierarchical clustering on an image of coins: Ward clustering

to split the image of coins in regions.

Hierarchical clustering: structured vs unstructured ward: Example of

Ward algorithm on a swiss-roll, comparison of structured approaches

versus unstructured approaches.

Feature agglomeration vs. univariate selection:

Example of dimensionality reduction with feature agglomeration based on

Ward hierarchical clustering.

Agglomerative clustering with and without structure

Warning

Connectivity constraints with single, average and complete linkage

Connectivity constraints and single, complete or average linkage can enhance

the ‘rich getting richer’ aspect of agglomerative clustering,

particularly so if they are built with

sklearn.neighbors.kneighbors_graph. In the limit of a small

number of clusters, they tend to give a few macroscopically occupied

clusters and almost empty ones. (see the discussion in

Agglomerative clustering with and without structure).

Single linkage is the most brittle linkage option with regard to this issue.

2.3.6.4. Varying the metric¶

Single, average and complete linkage can be used with a variety of distances (or

affinities), in particular Euclidean distance (l2), Manhattan distance

(or Cityblock, or l1), cosine distance, or any precomputed affinity

matrix.

l1 distance is often good for sparse features, or sparse noise: i.e.

many of the features are zero, as in text mining using occurrences of

rare words.

cosine distance is interesting because it is invariant to global

scalings of the signal.

The guidelines for choosing a metric is to use one that maximizes the

distance between samples in different classes, and minimizes that within

each class.

Examples:

Agglomerative clustering with different metrics

2.3.6.5. Bisecting K-Means¶

The BisectingKMeans is an iterative variant of KMeans, using

divisive hierarchical clustering. Instead of creating all centroids at once, centroids

are picked progressively based on a previous clustering: a cluster is split into two

new clusters repeatedly until the target number of clusters is reached.

BisectingKMeans is more efficient than KMeans when the number of

clusters is large since it only works on a subset of the data at each bisection

while KMeans always works on the entire dataset.

Although BisectingKMeans can’t benefit from the advantages of the "k-means++"

initialization by design, it will still produce comparable results than

KMeans(init="k-means++") in terms of inertia at cheaper computational costs, and will

likely produce better results than KMeans with a random initialization.

This variant is more efficient to agglomerative clustering if the number of clusters is

small compared to the number of data points.

This variant also does not produce empty clusters.

There exist two strategies for selecting the cluster to split:

bisecting_strategy="largest_cluster" selects the cluster having the most points

bisecting_strategy="biggest_inertia" selects the cluster with biggest inertia

(cluster with biggest Sum of Squared Errors within)

Picking by largest amount of data points in most cases produces result as

accurate as picking by inertia and is faster (especially for larger amount of data

points, where calculating error may be costly).

Picking by largest amount of data points will also likely produce clusters of similar

sizes while KMeans is known to produce clusters of different sizes.

Difference between Bisecting K-Means and regular K-Means can be seen on example

Bisecting K-Means and Regular K-Means Performance Comparison.

While the regular K-Means algorithm tends to create non-related clusters,

clusters from Bisecting K-Means are well ordered and create quite a visible hierarchy.

References:

“A Comparison of Document Clustering Techniques”

Michael Steinbach, George Karypis and Vipin Kumar,

Department of Computer Science and Egineering, University of Minnesota

(June 2000)

“Performance Analysis of K-Means and Bisecting K-Means Algorithms in Weblog Data”

K.Abirami and Dr.P.Mayilvahanan,

International Journal of Emerging Technologies in Engineering Research (IJETER)

Volume 4, Issue 8, (August 2016)

“Bisecting K-means Algorithm Based on K-valued Self-determining

and Clustering Center Optimization”

Jian Di, Xinyue Gou

School of Control and Computer Engineering,North China Electric Power University,

Baoding, Hebei, China (August 2017)

2.3.7. DBSCAN¶

The DBSCAN algorithm views clusters as areas of high density

separated by areas of low density. Due to this rather generic view, clusters

found by DBSCAN can be any shape, as opposed to k-means which assumes that

clusters are convex shaped. The central component to the DBSCAN is the concept

of core samples, which are samples that are in areas of high density. A

cluster is therefore a set of core samples, each close to each other

(measured by some distance measure)

and a set of non-core samples that are close to a core sample (but are not

themselves core samples). There are two parameters to the algorithm,

min_samples and eps,

which define formally what we mean when we say dense.

Higher min_samples or lower eps

indicate higher density necessary to form a cluster.

More formally, we define a core sample as being a sample in the dataset such

that there exist min_samples other samples within a distance of

eps, which are defined as neighbors of the core sample. This tells

us that the core sample is in a dense area of the vector space. A cluster

is a set of core samples that can be built by recursively taking a core

sample, finding all of its neighbors that are core samples, finding all of

their neighbors that are core samples, and so on. A cluster also has a

set of non-core samples, which are samples that are neighbors of a core sample

in the cluster but are not themselves core samples. Intuitively, these samples

are on the fringes of a cluster.

Any core sample is part of a cluster, by definition. Any sample that is not a

core sample, and is at least eps in distance from any core sample, is

considered an outlier by the algorithm.

While the parameter min_samples primarily controls how tolerant the

algorithm is towards noise (on noisy and large data sets it may be desirable

to increase this parameter), the parameter eps is crucial to choose

appropriately for the data set and distance function and usually cannot be

left at the default value. It controls the local neighborhood of the points.

When chosen too small, most data will not be clustered at all (and labeled

as -1 for “noise”). When chosen too large, it causes close clusters to

be merged into one cluster, and eventually the entire data set to be returned

as a single cluster. Some heuristics for choosing this parameter have been

discussed in the literature, for example based on a knee in the nearest neighbor

distances plot (as discussed in the references below).

In the figure below, the color indicates cluster membership, with large circles

indicating core samples found by the algorithm. Smaller circles are non-core

samples that are still part of a cluster. Moreover, the outliers are indicated

by black points below.

Examples:

Demo of DBSCAN clustering algorithm

Implementation

The DBSCAN algorithm is deterministic, always generating the same clusters

when given the same data in the same order. However, the results can differ when

data is provided in a different order. First, even though the core samples

will always be assigned to the same clusters, the labels of those clusters

will depend on the order in which those samples are encountered in the data.

Second and more importantly, the clusters to which non-core samples are assigned

can differ depending on the data order. This would happen when a non-core sample

has a distance lower than eps to two core samples in different clusters. By the

triangular inequality, those two core samples must be more distant than

eps from each other, or they would be in the same cluster. The non-core

sample is assigned to whichever cluster is generated first in a pass

through the data, and so the results will depend on the data ordering.

The current implementation uses ball trees and kd-trees

to determine the neighborhood of points,

which avoids calculating the full distance matrix

(as was done in scikit-learn versions before 0.14).

The possibility to use custom metrics is retained;

for details, see NearestNeighbors.

Memory consumption for large sample sizes

This implementation is by default not memory efficient because it constructs

a full pairwise similarity matrix in the case where kd-trees or ball-trees cannot

be used (e.g., with sparse matrices). This matrix will consume $n^2$ floats.

A couple of mechanisms for getting around this are:

Use OPTICS clustering in conjunction with the

extract_dbscan method. OPTICS clustering also calculates the full

pairwise matrix, but only keeps one row in memory at a time (memory

complexity n).

A sparse radius neighborhood graph (where missing entries are presumed to

be out of eps) can be precomputed in a memory-efficient way and dbscan

can be run over this with metric='precomputed'. See

sklearn.neighbors.NearestNeighbors.radius_neighbors_graph.

The dataset can be compressed, either by removing exact duplicates if

these occur in your data, or by using BIRCH. Then you only have a

relatively small number of representatives for a large number of points.

You can then provide a sample_weight when fitting DBSCAN.

References:

“A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases

with Noise”

Ester, M., H. P. Kriegel, J. Sander, and X. Xu,

In Proceedings of the 2nd International Conference on Knowledge Discovery

and Data Mining, Portland, OR, AAAI Press, pp. 226–231. 1996

“DBSCAN revisited, revisited: why and how you should (still) use DBSCAN.”

Schubert, E., Sander, J., Ester, M., Kriegel, H. P., & Xu, X. (2017).

In ACM Transactions on Database Systems (TODS), 42(3), 19.

2.3.8. HDBSCAN¶

The HDBSCAN algorithm can be seen as an extension of DBSCAN

and OPTICS. Specifically, DBSCAN assumes that the clustering

criterion (i.e. density requirement) is globally homogeneous.

In other words, DBSCAN may struggle to successfully capture clusters

with different densities.

HDBSCAN alleviates this assumption and explores all possible density

scales by building an alternative representation of the clustering problem.

Note

This implementation is adapted from the original implementation of HDBSCAN,

scikit-learn-contrib/hdbscan based on [LJ2017].

2.3.8.1. Mutual Reachability Graph¶

HDBSCAN first defines $d_c(x_p)$, the core distance of a sample $x_p$, as the

distance to its min_samples th-nearest neighbor, counting itself. For example,

if min_samples=5 and $x_*$ is the 5th-nearest neighbor of $x_p$

then the core distance is:

\[d_c(x_p)=d(x_p, x_*).\]

Next it defines $d_m(x_p, x_q)$, the mutual reachability distance of two points

$x_p, x_q$, as:

\[d_m(x_p, x_q) = \max\{d_c(x_p), d_c(x_q), d(x_p, x_q)\}\]

These two notions allow us to construct the mutual reachability graph

$G_{ms}$ defined for a fixed choice of min_samples by associating each

sample $x_p$ with a vertex of the graph, and thus edges between points

$x_p, x_q$ are the mutual reachability distance $d_m(x_p, x_q)$

between them. We may build subsets of this graph, denoted as

$G_{ms,\varepsilon}$, by removing any edges with value greater than $\varepsilon$:

from the original graph. Any points whose core distance is less than $\varepsilon$:

are at this staged marked as noise. The remaining points are then clustered by

finding the connected components of this trimmed graph.

Note

Taking the connected components of a trimmed graph $G_{ms,\varepsilon}$ is

equivalent to running DBSCAN* with min_samples and $\varepsilon$. DBSCAN* is a

slightly modified version of DBSCAN mentioned in [CM2013].

2.3.8.2. Hierarchical Clustering¶

HDBSCAN can be seen as an algorithm which performs DBSCAN* clustering across all

values of $\varepsilon$. As mentioned prior, this is equivalent to finding the connected

components of the mutual reachability graphs for all values of $\varepsilon$. To do this

efficiently, HDBSCAN first extracts a minimum spanning tree (MST) from the fully

-connected mutual reachability graph, then greedily cuts the edges with highest

weight. An outline of the HDBSCAN algorithm is as follows:

Extract the MST of $G_{ms}$.

Extend the MST by adding a “self edge” for each vertex, with weight equal

to the core distance of the underlying sample.

Initialize a single cluster and label for the MST.

Remove the edge with the greatest weight from the MST (ties are

removed simultaneously).

Assign cluster labels to the connected components which contain the

end points of the now-removed edge. If the component does not have at least

one edge it is instead assigned a “null” label marking it as noise.

Repeat 4-5 until there are no more connected components.

HDBSCAN is therefore able to obtain all possible partitions achievable by

DBSCAN* for a fixed choice of min_samples in a hierarchical fashion.

Indeed, this allows HDBSCAN to perform clustering across multiple densities

and as such it no longer needs $\varepsilon$ to be given as a hyperparameter. Instead

it relies solely on the choice of min_samples, which tends to be a more robust

hyperparameter.

HDBSCAN can be smoothed with an additional hyperparameter min_cluster_size

which specifies that during the hierarchical clustering, components with fewer

than minimum_cluster_size many samples are considered noise. In practice, one

can set minimum_cluster_size = min_samples to couple the parameters and

simplify the hyperparameter space.

References:

[CM2013]

Campello, R.J.G.B., Moulavi, D., Sander, J. (2013). Density-Based Clustering

Based on Hierarchical Density Estimates. In: Pei, J., Tseng, V.S., Cao, L.,

Motoda, H., Xu, G. (eds) Advances in Knowledge Discovery and Data Mining.

PAKDD 2013. Lecture Notes in Computer Science(), vol 7819. Springer, Berlin,

Heidelberg.

Density-Based Clustering Based on Hierarchical Density Estimates

[LJ2017]

L. McInnes and J. Healy, (2017). Accelerated Hierarchical Density Based

Clustering. In: IEEE International Conference on Data Mining Workshops (ICDMW),

2017, pp. 33-42.

Accelerated Hierarchical Density Based Clustering

2.3.9. OPTICS¶

The OPTICS algorithm shares many similarities with the DBSCAN

algorithm, and can be considered a generalization of DBSCAN that relaxes the

eps requirement from a single value to a value range. The key difference

between DBSCAN and OPTICS is that the OPTICS algorithm builds a reachability

graph, which assigns each sample both a reachability_ distance, and a spot

within the cluster ordering_ attribute; these two attributes are assigned

when the model is fitted, and are used to determine cluster membership. If

OPTICS is run with the default value of inf set for max_eps, then DBSCAN

style cluster extraction can be performed repeatedly in linear time for any

given eps value using the cluster_optics_dbscan method. Setting

max_eps to a lower value will result in shorter run times, and can be

thought of as the maximum neighborhood radius from each point to find other

potential reachable points.

The reachability distances generated by OPTICS allow for variable density

extraction of clusters within a single data set. As shown in the above plot,

combining reachability distances and data set ordering_ produces a

reachability plot, where point density is represented on the Y-axis, and

points are ordered such that nearby points are adjacent. ‘Cutting’ the

reachability plot at a single value produces DBSCAN like results; all points

above the ‘cut’ are classified as noise, and each time that there is a break

when reading from left to right signifies a new cluster. The default cluster

extraction with OPTICS looks at the steep slopes within the graph to find

clusters, and the user can define what counts as a steep slope using the

parameter xi. There are also other possibilities for analysis on the graph

itself, such as generating hierarchical representations of the data through

reachability-plot dendrograms, and the hierarchy of clusters detected by the

algorithm can be accessed through the cluster_hierarchy_ parameter. The

plot above has been color-coded so that cluster colors in planar space match

the linear segment clusters of the reachability plot. Note that the blue and

red clusters are adjacent in the reachability plot, and can be hierarchically

represented as children of a larger parent cluster.

Examples:

Demo of OPTICS clustering algorithm

Comparison with DBSCAN

The results from OPTICS cluster_optics_dbscan method and DBSCAN are

very similar, but not always identical; specifically, labeling of periphery

and noise points. This is in part because the first samples of each dense

area processed by OPTICS have a large reachability value while being close

to other points in their area, and will thus sometimes be marked as noise

rather than periphery. This affects adjacent points when they are

considered as candidates for being marked as either periphery or noise.

Note that for any single value of eps, DBSCAN will tend to have a

shorter run time than OPTICS; however, for repeated runs at varying eps

values, a single run of OPTICS may require less cumulative runtime than

DBSCAN. It is also important to note that OPTICS’ output is close to

DBSCAN’s only if eps and max_eps are close.

Computational Complexity

Spatial indexing trees are used to avoid calculating the full distance

matrix, and allow for efficient memory usage on large sets of samples.

Different distance metrics can be supplied via the metric keyword.

For large datasets, similar (but not identical) results can be obtained via

HDBSCAN. The HDBSCAN implementation is

multithreaded, and has better algorithmic runtime complexity than OPTICS,

at the cost of worse memory scaling. For extremely large datasets that

exhaust system memory using HDBSCAN, OPTICS will maintain $n$ (as opposed

to $n^2$) memory scaling; however, tuning of the max_eps parameter

will likely need to be used to give a solution in a reasonable amount of

wall time.

References:

“OPTICS: ordering points to identify the clustering structure.”

Ankerst, Mihael, Markus M. Breunig, Hans-Peter Kriegel, and Jörg Sander.

In ACM Sigmod Record, vol. 28, no. 2, pp. 49-60. ACM, 1999.

2.3.10. BIRCH¶

The Birch builds a tree called the Clustering Feature Tree (CFT)

for the given data. The data is essentially lossy compressed to a set of

Clustering Feature nodes (CF Nodes). The CF Nodes have a number of

subclusters called Clustering Feature subclusters (CF Subclusters)

and these CF Subclusters located in the non-terminal CF Nodes

can have CF Nodes as children.

The CF Subclusters hold the necessary information for clustering which prevents

the need to hold the entire input data in memory. This information includes:

Number of samples in a subcluster.

Linear Sum - An n-dimensional vector holding the sum of all samples

Squared Sum - Sum of the squared L2 norm of all samples.

Centroids - To avoid recalculation linear sum / n_samples.

Squared norm of the centroids.

The BIRCH algorithm has two parameters, the threshold and the branching factor.

The branching factor limits the number of subclusters in a node and the

threshold limits the distance between the entering sample and the existing

subclusters.

This algorithm can be viewed as an instance or data reduction method,

since it reduces the input data to a set of subclusters which are obtained directly

from the leaves of the CFT. This reduced data can be further processed by feeding

it into a global clusterer. This global clusterer can be set by n_clusters.

If n_clusters is set to None, the subclusters from the leaves are directly

read off, otherwise a global clustering step labels these subclusters into global

clusters (labels) and the samples are mapped to the global label of the nearest subcluster.

Algorithm description:

A new sample is inserted into the root of the CF Tree which is a CF Node.

It is then merged with the subcluster of the root, that has the smallest

radius after merging, constrained by the threshold and branching factor conditions.

If the subcluster has any child node, then this is done repeatedly till it reaches

a leaf. After finding the nearest subcluster in the leaf, the properties of this

subcluster and the parent subclusters are recursively updated.

If the radius of the subcluster obtained by merging the new sample and the

nearest subcluster is greater than the square of the threshold and if the

number of subclusters is greater than the branching factor, then a space is temporarily

allocated to this new sample. The two farthest subclusters are taken and

the subclusters are divided into two groups on the basis of the distance

between these subclusters.

If this split node has a parent subcluster and there is room

for a new subcluster, then the parent is split into two. If there is no room,

then this node is again split into two and the process is continued

recursively, till it reaches the root.

BIRCH or MiniBatchKMeans?

BIRCH does not scale very well to high dimensional data. As a rule of thumb if

n_features is greater than twenty, it is generally better to use MiniBatchKMeans.

If the number of instances of data needs to be reduced, or if one wants a

large number of subclusters either as a preprocessing step or otherwise,

BIRCH is more useful than MiniBatchKMeans.

How to use partial_fit?

To avoid the computation of global clustering, for every call of partial_fit

the user is advised

To set n_clusters=None initially

Train all data by multiple calls to partial_fit.

Set n_clusters to a required value using

brc.set_params(n_clusters=n_clusters).

Call partial_fit finally with no arguments, i.e. brc.partial_fit()

which performs the global clustering.

References:

Tian Zhang, Raghu Ramakrishnan, Maron Livny

BIRCH: An efficient data clustering method for large databases.

https://www.cs.sfu.ca/CourseCentral/459/han/papers/zhang96.pdf

Roberto Perdisci

JBirch - Java implementation of BIRCH clustering algorithm

https://code.google.com/archive/p/jbirch

2.3.11. Clustering performance evaluation¶

Evaluating the performance of a clustering algorithm is not as trivial as

counting the number of errors or the precision and recall of a supervised

classification algorithm. In particular any evaluation metric should not

take the absolute values of the cluster labels into account but rather

if this clustering define separations of the data similar to some ground

truth set of classes or satisfying some assumption such that members

belong to the same class are more similar than members of different

classes according to some similarity metric.

2.3.11.1. Rand index¶

Given the knowledge of the ground truth class assignments

labels_true and our clustering algorithm assignments of the same

samples labels_pred, the (adjusted or unadjusted) Rand index

is a function that measures the similarity of the two assignments,

ignoring permutations:

>>> from sklearn import metrics

>>> labels_true = [0, 0, 0, 1, 1, 1]

>>> labels_pred = [0, 0, 1, 1, 2, 2]

>>> metrics.rand_score(labels_true, labels_pred)

0.66...

The Rand index does not ensure to obtain a value close to 0.0 for a

random labelling. The adjusted Rand index corrects for chance and

will give such a baseline.

>>> metrics.adjusted_rand_score(labels_true, labels_pred)

0.24...

As with all clustering metrics, one can permute 0 and 1 in the predicted

labels, rename 2 to 3, and get the same score:

>>> labels_pred = [1, 1, 0, 0, 3, 3]

>>> metrics.rand_score(labels_true, labels_pred)

0.66...

>>> metrics.adjusted_rand_score(labels_true, labels_pred)

0.24...

Furthermore, both rand_score adjusted_rand_score are

symmetric: swapping the argument does not change the scores. They can

thus be used as consensus measures:

>>> metrics.rand_score(labels_pred, labels_true)

0.66...

>>> metrics.adjusted_rand_score(labels_pred, labels_true)

0.24...

Perfect labeling is scored 1.0:

>>> labels_pred = labels_true[:]

>>> metrics.rand_score(labels_true, labels_pred)

1.0

>>> metrics.adjusted_rand_score(labels_true, labels_pred)

1.0

Poorly agreeing labels (e.g. independent labelings) have lower scores,

and for the adjusted Rand index the score will be negative or close to

zero. However, for the unadjusted Rand index the score, while lower,

will not necessarily be close to zero.:

>>> labels_true = [0, 0, 0, 0, 0, 0, 1, 1]

>>> labels_pred = [0, 1, 2, 3, 4, 5, 5, 6]

>>> metrics.rand_score(labels_true, labels_pred)

0.39...

>>> metrics.adjusted_rand_score(labels_true, labels_pred)

-0.07...

2.3.11.1.1. Advantages¶

Interpretability: The unadjusted Rand index is proportional

to the number of sample pairs whose labels are the same in both

labels_pred and labels_true, or are different in both.

Random (uniform) label assignments have an adjusted Rand index

score close to 0.0 for any value of n_clusters and

n_samples (which is not the case for the unadjusted Rand index

or the V-measure for instance).

Bounded range: Lower values indicate different labelings,

similar clusterings have a high (adjusted or unadjusted) Rand index,

1.0 is the perfect match score. The score range is [0, 1] for the

unadjusted Rand index and [-1, 1] for the adjusted Rand index.

No assumption is made on the cluster structure: The (adjusted or

unadjusted) Rand index can be used to compare all kinds of

clustering algorithms, and can be used to compare clustering

algorithms such as k-means which assumes isotropic blob shapes with

results of spectral clustering algorithms which can find cluster

with “folded” shapes.

2.3.11.1.2. Drawbacks¶

Contrary to inertia, the (adjusted or unadjusted) Rand index

requires knowledge of the ground truth classes which is almost

never available in practice or requires manual assignment by human

annotators (as in the supervised learning setting).

However (adjusted or unadjusted) Rand index can also be useful in a

purely unsupervised setting as a building block for a Consensus

Index that can be used for clustering model selection (TODO).

The unadjusted Rand index is often close to 1.0 even if the

clusterings themselves differ significantly. This can be understood

when interpreting the Rand index as the accuracy of element pair

labeling resulting from the clusterings: In practice there often is

a majority of element pairs that are assigned the different pair

label under both the predicted and the ground truth clustering

resulting in a high proportion of pair labels that agree, which

leads subsequently to a high score.

Examples:

Adjustment for chance in clustering performance evaluation:

Analysis of the impact of the dataset size on the value of

clustering measures for random assignments.

2.3.11.1.3. Mathematical formulation¶

If C is a ground truth class assignment and K the clustering, let us

define $a$ and $b$ as:

$a$, the number of pairs of elements that are in the same set

in C and in the same set in K

$b$, the number of pairs of elements that are in different sets

in C and in different sets in K

The unadjusted Rand index is then given by:

\[\text{RI} = \frac{a + b}{C_2^{n_{samples}}}\]

where $C_2^{n_{samples}}$ is the total number of possible pairs

in the dataset. It does not matter if the calculation is performed on

ordered pairs or unordered pairs as long as the calculation is

performed consistently.

However, the Rand index does not guarantee that random label assignments

will get a value close to zero (esp. if the number of clusters is in

the same order of magnitude as the number of samples).

To counter this effect we can discount the expected RI $E[\text{RI}]$ of

random labelings by defining the adjusted Rand index as follows:

\[\text{ARI} = \frac{\text{RI} - E[\text{RI}]}{\max(\text{RI}) - E[\text{RI}]}\]

References

Comparing Partitions

L. Hubert and P. Arabie, Journal of Classification 1985

Properties of the Hubert-Arabie adjusted Rand index

D. Steinley, Psychological Methods 2004

Wikipedia entry for the Rand index

Wikipedia entry for the adjusted Rand index

2.3.11.2. Mutual Information based scores¶

Given the knowledge of the ground truth class assignments labels_true and

our clustering algorithm assignments of the same samples labels_pred, the

Mutual Information is a function that measures the agreement of the two

assignments, ignoring permutations. Two different normalized versions of this

measure are available, Normalized Mutual Information (NMI) and Adjusted

Mutual Information (AMI). NMI is often used in the literature, while AMI was

proposed more recently and is normalized against chance:

>>> from sklearn import metrics

>>> labels_true = [0, 0, 0, 1, 1, 1]

>>> labels_pred = [0, 0, 1, 1, 2, 2]

>>> metrics.adjusted_mutual_info_score(labels_true, labels_pred)

0.22504...

One can permute 0 and 1 in the predicted labels, rename 2 to 3 and get

the same score:

>>> labels_pred = [1, 1, 0, 0, 3, 3]

>>> metrics.adjusted_mutual_info_score(labels_true, labels_pred)

0.22504...

All, mutual_info_score, adjusted_mutual_info_score and

normalized_mutual_info_score are symmetric: swapping the argument does

not change the score. Thus they can be used as a consensus measure:

>>> metrics.adjusted_mutual_info_score(labels_pred, labels_true)

0.22504...

Perfect labeling is scored 1.0:

>>> labels_pred = labels_true[:]

>>> metrics.adjusted_mutual_info_score(labels_true, labels_pred)

1.0

>>> metrics.normalized_mutual_info_score(labels_true, labels_pred)

1.0

This is not true for mutual_info_score, which is therefore harder to judge:

>>> metrics.mutual_info_score(labels_true, labels_pred)

0.69...

Bad (e.g. independent labelings) have non-positive scores:

>>> labels_true = [0, 1, 2, 0, 3, 4, 5, 1]

>>> labels_pred = [1, 1, 0, 0, 2, 2, 2, 2]

>>> metrics.adjusted_mutual_info_score(labels_true, labels_pred)

-0.10526...

2.3.11.2.1. Advantages¶

Random (uniform) label assignments have a AMI score close to 0.0

for any value of n_clusters and n_samples (which is not the

case for raw Mutual Information or the V-measure for instance).

Upper bound of 1: Values close to zero indicate two label

assignments that are largely independent, while values close to one

indicate significant agreement. Further, an AMI of exactly 1 indicates

that the two label assignments are equal (with or without permutation).

2.3.11.2.2. Drawbacks¶

Contrary to inertia, MI-based measures require the knowledge

of the ground truth classes while almost never available in practice or

requires manual assignment by human annotators (as in the supervised learning

setting).

However MI-based measures can also be useful in purely unsupervised setting as a

building block for a Consensus Index that can be used for clustering

model selection.

NMI and MI are not adjusted against chance.

Examples:

Adjustment for chance in clustering performance evaluation: Analysis of

the impact of the dataset size on the value of clustering measures

for random assignments. This example also includes the Adjusted Rand

Index.

2.3.11.2.3. Mathematical formulation¶

Assume two label assignments (of the same N objects), $U$ and $V$.

Their entropy is the amount of uncertainty for a partition set, defined by:

\[H(U) = - \sum_{i=1}^{|U|}P(i)\log(P(i))\]

where $P(i) = |U_i| / N$ is the probability that an object picked at

random from $U$ falls into class $U_i$. Likewise for $V$:

\[H(V) = - \sum_{j=1}^{|V|}P'(j)\log(P'(j))\]

With $P'(j) = |V_j| / N$. The mutual information (MI) between $U$

and $V$ is calculated by:

\[\text{MI}(U, V) = \sum_{i=1}^{|U|}\sum_{j=1}^{|V|}P(i, j)\log\left(\frac{P(i,j)}{P(i)P'(j)}\right)\]

where $P(i, j) = |U_i \cap V_j| / N$ is the probability that an object

picked at random falls into both classes $U_i$ and $V_j$.

It also can be expressed in set cardinality formulation:

\[\text{MI}(U, V) = \sum_{i=1}^{|U|} \sum_{j=1}^{|V|} \frac{|U_i \cap V_j|}{N}\log\left(\frac{N|U_i \cap V_j|}{|U_i||V_j|}\right)\]

The normalized mutual information is defined as

\[\text{NMI}(U, V) = \frac{\text{MI}(U, V)}{\text{mean}(H(U), H(V))}\]

This value of the mutual information and also the normalized variant is not

adjusted for chance and will tend to increase as the number of different labels

(clusters) increases, regardless of the actual amount of “mutual information”

between the label assignments.

The expected value for the mutual information can be calculated using the

following equation [VEB2009]. In this equation,

$a_i = |U_i|$ (the number of elements in $U_i$) and

$b_j = |V_j|$ (the number of elements in $V_j$).

\[E[\text{MI}(U,V)]=\sum_{i=1}^{|U|} \sum_{j=1}^{|V|} \sum_{n_{ij}=(a_i+b_j-N)^+

}^{\min(a_i, b_j)} \frac{n_{ij}}{N}\log \left( \frac{ N.n_{ij}}{a_i b_j}\right)

\frac{a_i!b_j!(N-a_i)!(N-b_j)!}{N!n_{ij}!(a_i-n_{ij})!(b_j-n_{ij})!

(N-a_i-b_j+n_{ij})!}\]

Using the expected value, the adjusted mutual information can then be

calculated using a similar form to that of the adjusted Rand index:

\[\text{AMI} = \frac{\text{MI} - E[\text{MI}]}{\text{mean}(H(U), H(V)) - E[\text{MI}]}\]

For normalized mutual information and adjusted mutual information, the normalizing

value is typically some generalized mean of the entropies of each clustering.

Various generalized means exist, and no firm rules exist for preferring one over the

others. The decision is largely a field-by-field basis; for instance, in community

detection, the arithmetic mean is most common. Each

normalizing method provides “qualitatively similar behaviours” [YAT2016]. In our

implementation, this is controlled by the average_method parameter.

Vinh et al. (2010) named variants of NMI and AMI by their averaging method [VEB2010]. Their

‘sqrt’ and ‘sum’ averages are the geometric and arithmetic means; we use these

more broadly common names.

References

Strehl, Alexander, and Joydeep Ghosh (2002). “Cluster ensembles – a

knowledge reuse framework for combining multiple partitions”. Journal of

Machine Learning Research 3: 583–617.

doi:10.1162/153244303321897735.

Wikipedia entry for the (normalized) Mutual Information

Wikipedia entry for the Adjusted Mutual Information

[VEB2009]

Vinh, Epps, and Bailey, (2009). “Information theoretic measures

for clusterings comparison”. Proceedings of the 26th Annual International

Conference on Machine Learning - ICML ‘09.

doi:10.1145/1553374.1553511.

ISBN 9781605585161.

[VEB2010]

Vinh, Epps, and Bailey, (2010). “Information Theoretic Measures for

Clusterings Comparison: Variants, Properties, Normalization and

Correction for Chance”. JMLR

[YAT2016]

Yang, Algesheimer, and Tessone, (2016). “A comparative analysis of

community

detection algorithms on artificial networks”. Scientific Reports 6: 30750.

doi:10.1038/srep30750.

2.3.11.3. Homogeneity, completeness and V-measure¶

Given the knowledge of the ground truth class assignments of the samples,

it is possible to define some intuitive metric using conditional entropy

analysis.

In particular Rosenberg and Hirschberg (2007) define the following two

desirable objectives for any cluster assignment:

homogeneity: each cluster contains only members of a single class.

completeness: all members of a given class are assigned to the same

cluster.

We can turn those concept as scores homogeneity_score and

completeness_score. Both are bounded below by 0.0 and above by

1.0 (higher is better):

>>> from sklearn import metrics

>>> labels_true = [0, 0, 0, 1, 1, 1]

>>> labels_pred = [0, 0, 1, 1, 2, 2]

>>> metrics.homogeneity_score(labels_true, labels_pred)

0.66...

>>> metrics.completeness_score(labels_true, labels_pred)

0.42...

Their harmonic mean called V-measure is computed by

v_measure_score:

>>> metrics.v_measure_score(labels_true, labels_pred)

0.51...

This function’s formula is as follows:

\[v = \frac{(1 + \beta) \times \text{homogeneity} \times \text{completeness}}{(\beta \times \text{homogeneity} + \text{completeness})}\]

beta defaults to a value of 1.0, but for using a value less than 1 for beta:

>>> metrics.v_measure_score(labels_true, labels_pred, beta=0.6)

0.54...

more weight will be attributed to homogeneity, and using a value greater than 1:

>>> metrics.v_measure_score(labels_true, labels_pred, beta=1.8)

0.48...

more weight will be attributed to completeness.

The V-measure is actually equivalent to the mutual information (NMI)

discussed above, with the aggregation function being the arithmetic mean [B2011].

Homogeneity, completeness and V-measure can be computed at once using

homogeneity_completeness_v_measure as follows:

>>> metrics.homogeneity_completeness_v_measure(labels_true, labels_pred)

(0.66..., 0.42..., 0.51...)

The following clustering assignment is slightly better, since it is

homogeneous but not complete:

>>> labels_pred = [0, 0, 0, 1, 2, 2]

>>> metrics.homogeneity_completeness_v_measure(labels_true, labels_pred)

(1.0, 0.68..., 0.81...)

Note

v_measure_score is symmetric: it can be used to evaluate

the agreement of two independent assignments on the same dataset.

This is not the case for completeness_score and

homogeneity_score: both are bound by the relationship:

homogeneity_score(a, b) == completeness_score(b, a)

2.3.11.3.1. Advantages¶

Bounded scores: 0.0 is as bad as it can be, 1.0 is a perfect score.

Intuitive interpretation: clustering with bad V-measure can be

qualitatively analyzed in terms of homogeneity and completeness

to better feel what ‘kind’ of mistakes is done by the assignment.

No assumption is made on the cluster structure: can be used

to compare clustering algorithms such as k-means which assumes isotropic

blob shapes with results of spectral clustering algorithms which can

find cluster with “folded” shapes.

2.3.11.3.2. Drawbacks¶

The previously introduced metrics are not normalized with regards to

random labeling: this means that depending on the number of samples,

clusters and ground truth classes, a completely random labeling will

not always yield the same values for homogeneity, completeness and

hence v-measure. In particular random labeling won’t yield zero

scores especially when the number of clusters is large.

This problem can safely be ignored when the number of samples is more

than a thousand and the number of clusters is less than 10. For

smaller sample sizes or larger number of clusters it is safer to use

an adjusted index such as the Adjusted Rand Index (ARI).

These metrics require the knowledge of the ground truth classes while

almost never available in practice or requires manual assignment by

human annotators (as in the supervised learning setting).

Examples:

Adjustment for chance in clustering performance evaluation: Analysis of

the impact of the dataset size on the value of clustering measures

for random assignments.

2.3.11.3.3. Mathematical formulation¶

Homogeneity and completeness scores are formally given by:

\[h = 1 - \frac{H(C|K)}{H(C)}\]

\[c = 1 - \frac{H(K|C)}{H(K)}\]

where $H(C|K)$ is the conditional entropy of the classes given

the cluster assignments and is given by:

\[H(C|K) = - \sum_{c=1}^{|C|} \sum_{k=1}^{|K|} \frac{n_{c,k}}{n}

\cdot \log\left(\frac{n_{c,k}}{n_k}\right)\]

and $H(C)$ is the entropy of the classes and is given by:

\[H(C) = - \sum_{c=1}^{|C|} \frac{n_c}{n} \cdot \log\left(\frac{n_c}{n}\right)\]

with $n$ the total number of samples, $n_c$ and $n_k$

the number of samples respectively belonging to class $c$ and

cluster $k$, and finally $n_{c,k}$ the number of samples

from class $c$ assigned to cluster $k$.

The conditional entropy of clusters given class $H(K|C)$ and the

entropy of clusters $H(K)$ are defined in a symmetric manner.

Rosenberg and Hirschberg further define V-measure as the harmonic

mean of homogeneity and completeness:

\[v = 2 \cdot \frac{h \cdot c}{h + c}\]

References

V-Measure: A conditional entropy-based external cluster evaluation

measure

Andrew Rosenberg and Julia Hirschberg, 2007

[B2011]

Identification and Characterization of Events in Social Media, Hila

Becker, PhD Thesis.

2.3.11.4. Fowlkes-Mallows scores¶

The Fowlkes-Mallows index (sklearn.metrics.fowlkes_mallows_score) can be

used when the ground truth class assignments of the samples is known. The

Fowlkes-Mallows score FMI is defined as the geometric mean of the

pairwise precision and recall:

\[\text{FMI} = \frac{\text{TP}}{\sqrt{(\text{TP} + \text{FP}) (\text{TP} + \text{FN})}}\]

Where TP is the number of True Positive (i.e. the number of pair

of points that belong to the same clusters in both the true labels and the

predicted labels), FP is the number of False Positive (i.e. the number

of pair of points that belong to the same clusters in the true labels and not

in the predicted labels) and FN is the number of False Negative (i.e. the

number of pair of points that belongs in the same clusters in the predicted

labels and not in the true labels).

The score ranges from 0 to 1. A high value indicates a good similarity

between two clusters.

>>> from sklearn import metrics

>>> labels_true = [0, 0, 0, 1, 1, 1]

>>> labels_pred = [0, 0, 1, 1, 2, 2]

>>> metrics.fowlkes_mallows_score(labels_true, labels_pred)

0.47140...

One can permute 0 and 1 in the predicted labels, rename 2 to 3 and get

the same score:

>>> labels_pred = [1, 1, 0, 0, 3, 3]

>>> metrics.fowlkes_mallows_score(labels_true, labels_pred)

0.47140...

Perfect labeling is scored 1.0:

>>> labels_pred = labels_true[:]

>>> metrics.fowlkes_mallows_score(labels_true, labels_pred)

1.0

Bad (e.g. independent labelings) have zero scores:

>>> labels_true = [0, 1, 2, 0, 3, 4, 5, 1]

>>> labels_pred = [1, 1, 0, 0, 2, 2, 2, 2]

>>> metrics.fowlkes_mallows_score(labels_true, labels_pred)

0.0

2.3.11.4.1. Advantages¶

Random (uniform) label assignments have a FMI score close to 0.0

for any value of n_clusters and n_samples (which is not the

case for raw Mutual Information or the V-measure for instance).

Upper-bounded at 1: Values close to zero indicate two label

assignments that are largely independent, while values close to one

indicate significant agreement. Further, values of exactly 0 indicate

purely independent label assignments and a FMI of exactly 1 indicates

that the two label assignments are equal (with or without permutation).

No assumption is made on the cluster structure: can be used

to compare clustering algorithms such as k-means which assumes isotropic

blob shapes with results of spectral clustering algorithms which can

find cluster with “folded” shapes.

2.3.11.4.2. Drawbacks¶

Contrary to inertia, FMI-based measures require the knowledge

of the ground truth classes while almost never available in practice or

requires manual assignment by human annotators (as in the supervised learning

setting).

References

E. B. Fowkles and C. L. Mallows, 1983. “A method for comparing two

hierarchical clusterings”. Journal of the American Statistical Association.

https://www.tandfonline.com/doi/abs/10.1080/01621459.1983.10478008

Wikipedia entry for the Fowlkes-Mallows Index

2.3.11.5. Silhouette Coefficient¶

If the ground truth labels are not known, evaluation must be performed using

the model itself. The Silhouette Coefficient

(sklearn.metrics.silhouette_score)

is an example of such an evaluation, where a

higher Silhouette Coefficient score relates to a model with better defined

clusters. The Silhouette Coefficient is defined for each sample and is composed

of two scores:

a: The mean distance between a sample and all other points in the same

class.

b: The mean distance between a sample and all other points in the next

nearest cluster.

The Silhouette Coefficient s for a single sample is then given as:

\[s = \frac{b - a}{max(a, b)}\]

The Silhouette Coefficient for a set of samples is given as the mean of the

Silhouette Coefficient for each sample.

>>> from sklearn import metrics

>>> from sklearn.metrics import pairwise_distances

>>> from sklearn import datasets

>>> X, y = datasets.load_iris(return_X_y=True)

In normal usage, the Silhouette Coefficient is applied to the results of a

cluster analysis.

>>> import numpy as np

>>> from sklearn.cluster import KMeans

>>> kmeans_model = KMeans(n_clusters=3, random_state=1).fit(X)

>>> labels = kmeans_model.labels_

>>> metrics.silhouette_score(X, labels, metric='euclidean')

0.55...

References

Peter J. Rousseeuw (1987). “Silhouettes: a Graphical Aid to the

Interpretation and Validation of Cluster Analysis”

. Computational and Applied Mathematics 20: 53–65.

2.3.11.5.1. Advantages¶

The score is bounded between -1 for incorrect clustering and +1 for highly

dense clustering. Scores around zero indicate overlapping clusters.

The score is higher when clusters are dense and well separated, which relates

to a standard concept of a cluster.

2.3.11.5.2. Drawbacks¶

The Silhouette Coefficient is generally higher for convex clusters than other

concepts of clusters, such as density based clusters like those obtained

through DBSCAN.

Examples:

Selecting the number of clusters with silhouette analysis on KMeans clustering : In this example

the silhouette analysis is used to choose an optimal value for n_clusters.

2.3.11.6. Calinski-Harabasz Index¶

If the ground truth labels are not known, the Calinski-Harabasz index

(sklearn.metrics.calinski_harabasz_score) - also known as the Variance

Ratio Criterion - can be used to evaluate the model, where a higher

Calinski-Harabasz score relates to a model with better defined clusters.

The index is the ratio of the sum of between-clusters dispersion and of

within-cluster dispersion for all clusters (where dispersion is defined as the

sum of distances squared):

>>> from sklearn import metrics

>>> from sklearn.metrics import pairwise_distances

>>> from sklearn import datasets

>>> X, y = datasets.load_iris(return_X_y=True)

In normal usage, the Calinski-Harabasz index is applied to the results of a

cluster analysis:

>>> import numpy as np

>>> from sklearn.cluster import KMeans

>>> kmeans_model = KMeans(n_clusters=3, random_state=1).fit(X)

>>> labels = kmeans_model.labels_

>>> metrics.calinski_harabasz_score(X, labels)

561.59...

2.3.11.6.1. Advantages¶

The score is higher when clusters are dense and well separated, which relates

to a standard concept of a cluster.

The score is fast to compute.

2.3.11.6.2. Drawbacks¶

The Calinski-Harabasz index is generally higher for convex clusters than other

concepts of clusters, such as density based clusters like those obtained

through DBSCAN.

2.3.11.6.3. Mathematical formulation¶

For a set of data $E$ of size $n_E$ which has been clustered into

$k$ clusters, the Calinski-Harabasz score $s$ is defined as the

ratio of the between-clusters dispersion mean and the within-cluster dispersion:

\[s = \frac{\mathrm{tr}(B_k)}{\mathrm{tr}(W_k)} \times \frac{n_E - k}{k - 1}\]

where $\mathrm{tr}(B_k)$ is trace of the between group dispersion matrix

and $\mathrm{tr}(W_k)$ is the trace of the within-cluster dispersion

matrix defined by:

\[W_k = \sum_{q=1}^k \sum_{x \in C_q} (x - c_q) (x - c_q)^T\]

\[B_k = \sum_{q=1}^k n_q (c_q - c_E) (c_q - c_E)^T\]

with $C_q$ the set of points in cluster $q$, $c_q$ the center

of cluster $q$, $c_E$ the center of $E$, and $n_q$ the

number of points in cluster $q$.

References

Caliński, T., & Harabasz, J. (1974).

“A Dendrite Method for Cluster Analysis”.

Communications in Statistics-theory and Methods 3: 1-27.

2.3.11.7. Davies-Bouldin Index¶

If the ground truth labels are not known, the Davies-Bouldin index

(sklearn.metrics.davies_bouldin_score) can be used to evaluate the

model, where a lower Davies-Bouldin index relates to a model with better

separation between the clusters.

This index signifies the average ‘similarity’ between clusters, where the

similarity is a measure that compares the distance between clusters with the

size of the clusters themselves.

Zero is the lowest possible score. Values closer to zero indicate a better

partition.

In normal usage, the Davies-Bouldin index is applied to the results of a

cluster analysis as follows:

>>> from sklearn import datasets

>>> iris = datasets.load_iris()

>>> X = iris.data

>>> from sklearn.cluster import KMeans

>>> from sklearn.metrics import davies_bouldin_score

>>> kmeans = KMeans(n_clusters=3, random_state=1).fit(X)

>>> labels = kmeans.labels_

>>> davies_bouldin_score(X, labels)

0.666...

2.3.11.7.1. Advantages¶

The computation of Davies-Bouldin is simpler than that of Silhouette scores.

The index is solely based on quantities and features inherent to the dataset

as its computation only uses point-wise distances.

2.3.11.7.2. Drawbacks¶

The Davies-Boulding index is generally higher for convex clusters than other

concepts of clusters, such as density based clusters like those obtained from

DBSCAN.

The usage of centroid distance limits the distance metric to Euclidean space.

2.3.11.7.3. Mathematical formulation¶

The index is defined as the average similarity between each cluster $C_i$

for $i=1, ..., k$ and its most similar one $C_j$. In the context of

this index, similarity is defined as a measure $R_{ij}$ that trades off:

$s_i$, the average distance between each point of cluster $i$ and

the centroid of that cluster – also know as cluster diameter.

$d_{ij}$, the distance between cluster centroids $i$ and $j$.

A simple choice to construct $R_{ij}$ so that it is nonnegative and

symmetric is:

\[R_{ij} = \frac{s_i + s_j}{d_{ij}}\]

Then the Davies-Bouldin index is defined as:

\[DB = \frac{1}{k} \sum_{i=1}^k \max_{i \neq j} R_{ij}\]

References

Davies, David L.; Bouldin, Donald W. (1979).

“A Cluster Separation Measure”

IEEE Transactions on Pattern Analysis and Machine Intelligence.

PAMI-1 (2): 224-227.

Halkidi, Maria; Batistakis, Yannis; Vazirgiannis, Michalis (2001).

“On Clustering Validation Techniques”

Journal of Intelligent Information Systems, 17(2-3), 107-145.

Wikipedia entry for Davies-Bouldin index.

2.3.11.8. Contingency Matrix¶

Contingency matrix (sklearn.metrics.cluster.contingency_matrix)

reports the intersection cardinality for every true/predicted cluster pair.

The contingency matrix provides sufficient statistics for all clustering

metrics where the samples are independent and identically distributed and

one doesn’t need to account for some instances not being clustered.

Here is an example:

>>> from sklearn.metrics.cluster import contingency_matrix

>>> x = ["a", "a", "a", "b", "b", "b"]

>>> y = [0, 0, 1, 1, 2, 2]

>>> contingency_matrix(x, y)

array([[2, 1, 0],

[0, 1, 2]])

The first row of output array indicates that there are three samples whose

true cluster is “a”. Of them, two are in predicted cluster 0, one is in 1,

and none is in 2. And the second row indicates that there are three samples

whose true cluster is “b”. Of them, none is in predicted cluster 0, one is in

1 and two are in 2.

A confusion matrix for classification is a square

contingency matrix where the order of rows and columns correspond to a list

of classes.

2.3.11.8.1. Advantages¶

Allows to examine the spread of each true cluster across predicted

clusters and vice versa.

The contingency table calculated is typically utilized in the calculation

of a similarity statistic (like the others listed in this document) between

the two clusterings.

2.3.11.8.2. Drawbacks¶

Contingency matrix is easy to interpret for a small number of clusters, but

becomes very hard to interpret for a large number of clusters.

It doesn’t give a single metric to use as an objective for clustering

optimisation.

References

Wikipedia entry for contingency matrix

2.3.11.9. Pair Confusion Matrix¶

The pair confusion matrix

(sklearn.metrics.cluster.pair_confusion_matrix) is a 2x2

similarity matrix

\[\begin{split}C = \left[\begin{matrix}

C_{00} & C_{01} \\

C_{10} & C_{11}

\end{matrix}\right]\end{split}\]

between two clusterings computed by considering all pairs of samples and

counting pairs that are assigned into the same or into different clusters

under the true and predicted clusterings.

It has the following entries:

$C_{00}$ : number of pairs with both clusterings having the samples

not clustered together

$C_{10}$ : number of pairs with the true label clustering having the

samples clustered together but the other clustering not having the samples

clustered together

$C_{01}$ : number of pairs with the true label clustering not having

the samples clustered together but the other clustering having the samples

clustered together

$C_{11}$ : number of pairs with both clusterings having the samples

clustered together

Considering a pair of samples that is clustered together a positive pair,

then as in binary classification the count of true negatives is

$C_{00}$, false negatives is $C_{10}$, true positives is

$C_{11}$ and false positives is $C_{01}$.

Perfectly matching labelings have all non-zero entries on the

diagonal regardless of actual label values:

>>> from sklearn.metrics.cluster import pair_confusion_matrix

>>> pair_confusion_matrix([0, 0, 1, 1], [0, 0, 1, 1])

array([[8, 0],

[0, 4]])

>>> pair_confusion_matrix([0, 0, 1, 1], [1, 1, 0, 0])

array([[8, 0],

[0, 4]])

Labelings that assign all classes members to the same clusters

are complete but may not always be pure, hence penalized, and

have some off-diagonal non-zero entries:

>>> pair_confusion_matrix([0, 0, 1, 2], [0, 0, 1, 1])

array([[8, 2],

[0, 2]])

The matrix is not symmetric:

>>> pair_confusion_matrix([0, 0, 1, 1], [0, 0, 1, 2])

array([[8, 0],

[2, 2]])

If classes members are completely split across different clusters, the

assignment is totally incomplete, hence the matrix has all zero

diagonal entries:

>>> pair_confusion_matrix([0, 0, 0, 0], [0, 1, 2, 3])

array([[ 0, 0],

[12, 0]])

References

“Comparing Partitions”

L. Hubert and P. Arabie, Journal of Classification 1985

Show this page source

sklearn

sklearn 中文文档

作者

整理

校招巴士

安装 scikit-learn

1. 监督学习

1.0 监督学习

1.1. 广义线性模型

1.2. 线性和二次判别分析

1.3. 内核岭回归

1.4. 支持向量机

1.5. 随机梯度下降

1.6. 最近邻

1.7. 高斯过程

1.8. 交叉分解

1.9. 朴素贝叶斯

1.10. 决策树

1.11. 集成方法

1.12. 多类和多标签算法

1.13. 特征选择

1.14. 半监督学习

1.15. 等式回归

1.16. 概率校准

1.17. 神经网络模型（有监督）

2. 无监督学习

2.0 无监督学习

2.1. 高斯混合模型

2.2. 流形学习

2.3. 聚类

2.4. 双聚类

2.5. 分解成分中的信号（矩阵分解问题）

2.6. 协方差估计

2.7. 新奇和异常值检测

2.8. 密度估计

2.9. 神经网络模型（无监督）

3. 模型选择和评估

3.0 模型选择和评估

3.1. 交叉验证：评估估算器的表现

3.2. 调整估计器的超参数

3.3. 模型评估: 量化预测的质量

3.4. 模型持久化

3.5. 验证曲线: 绘制分数以评估模型

4. 检验

4.0 检验

4.1. 部分依赖图

5. 数据集转换

5.0 数据集转换

5.1. Pipeline（管道）和 FeatureUnion（特征联合）: 合并的评估器

5.2. 特征提取

5.3 预处理数据

5.4 缺失值插补

5.5. 无监督降维

5.6. 随机投影

5.7. 内核近似

5.8. 成对的矩阵, 类别和核函数

5.9. 预测目标

6. 数据集加载工具

6.0 数据集加载工具

6.1. 通用数据集 API

6.2. 玩具数据集

6.3 真实世界中的数据集

6.4. 样本生成器

6.5. 加载其他数据集

7. 使用scikit-learn计算

7.0 使用scikit-learn计算

7.1. 大规模计算的策略: 更大量的数据

7.2. 计算性能

7.3. 并行性、资源管理和配置

教程

使用 scikit-learn 介绍机器学习

关于科学数据处理的统计学习教程

机器学习: scikit-learn 中的设置以及预估对象

监督学习：从高维观察预测输出变量

模型选择：选择估计量及其参数

无监督学习: 寻求数据表示

把它们放在一起

寻求帮助

处理文本数据

选择正确的评估器(estimator.md

外部资源，视频和谈话

API 参考

常见问题

时光轴

sklearn

Docs »

sklearn 中文文档

sklearn 简介

scikit-learn 是基于 Python 语言的机器学习工具

简单高效的数据挖掘和数据分析工具

可供大家在各种环境中重复使用

建立在 NumPy ，SciPy 和 matplotlib 上

开源，可商业使用 - BSD许可证

点击下载OpenCV最新中文官方文档pdf版

安装 scikit-learn

用户指南

1. 监督学习

1.1. 广义线性模型

1.2. 线性和二次判别分析

1.3. 内核岭回归

1.4. 支持向量机

1.5. 随机梯度下降

1.6. 最近邻

1.7. 高斯过程

1.8. 交叉分解

1.9. 朴素贝叶斯

1.10. 决策树

1.11. 集成方法

1.12. 多类和多标签算法

1.13. 特征选择

1.14. 半监督学习

1.15. 等式回归

1.16. 概率校准

1.17. 神经网络模型（有监督）

2. 无监督学习

2.1. 高斯混合模型

2.2. 流形学习

2.3. 聚类

2.4. 双聚类

2.5. 分解成分中的信号（矩阵分解问题）

2.6. 协方差估计

2.7. 新奇和异常值检测

2.8. 密度估计

2.9. 神经网络模型（无监督）

3. 模型选择和评估

3.1. 交叉验证：评估估算器的表现

3.2. 调整估计器的超参数

3.3. 模型评估: 量化预测的质量

3.4. 模型持久化

3.5. 验证曲线: 绘制分数以评估模型

4. 检验

4.1. 部分依赖图

5. 数据集转换

5.1. Pipeline（管道）和 FeatureUnion（特征联合）: 合并的评估器

5.2. 特征提取

5.3 预处理数据

5.4 缺失值插补

5.5. 无监督降维

5.6. 随机投影

5.7. 内核近似

5.8. 成对的矩阵, 类别和核函数

5.9. 预测目标 (y) 的转换

6. 数据集加载工具

6.1. 通用数据集 API

6.2. 玩具数据集

6.3 真实世界中的数据集

6.4. 样本生成器

6.5. 加载其他数据集

7. 使用scikit-learn计算

7.1. 大规模计算的策略: 更大量的数据

7.2. 计算性能

7.3. 并行性、资源管理和配置

教程

使用 scikit-learn 介绍机器学习

关于科学数据处理的统计学习教程

机器学习: scikit-learn 中的设置以及预估对象

监督学习：从高维观察预测输出变量

模型选择：选择估计量及其参数

无监督学习: 寻求数据表示

把它们放在一起

寻求帮助

处理文本数据

选择正确的评估器(estimator/)

外部资源，视频和谈话

API 参考

常见问题

时光轴

作者

sklearn-doc-zh：https://github.com/apachecn/sklearn-doc-zh

整理

http://scikitlearn.com.cn/

校招巴士

校招巴士网站一个专注于大学生校招求职的平台！旨在分享互联网大厂内推、校招资讯、面经笔经、职场干货、简历技巧等，助力百万大学生校招求职！

sklearn PythonOK 协议：CC BY-NC-SA 4.0

Built with MkDocs using a theme provided by Read the Docs.

scikit-learn (sklearn) 官方文档中文版 - sklearn

sklearn

sklearn 中文文档

安装 scikit-learn

1. 监督学习

1.0 监督学习

1.1. 广义线性模型

1.2. 线性和二次判别分析

1.3. 内核岭回归

1.4. 支持向量机

1.5. 随机梯度下降

1.6. 最近邻

1.7. 高斯过程

1.8. 交叉分解

1.9. 朴素贝叶斯

1.10. 决策树

1.11. 集成方法

1.12. 多类和多标签算法

1.13. 特征选择

1.14. 半监督学习

1.15. 等式回归

1.16. 概率校准

1.17. 神经网络模型（有监督）

2. 无监督学习

2.0 无监督学习

2.1. 高斯混合模型

2.2. 流形学习

2.3. 聚类

2.4. 双聚类

2.5. 分解成分中的信号（矩阵分解问题）

2.6. 协方差估计

2.7. 新奇和异常值检测

2.8. 密度估计

2.9. 神经网络模型（无监督）

3. 模型选择和评估

3.0 模型选择和评估

3.1. 交叉验证：评估估算器的表现

3.2. 调整估计器的超参数

3.3. 模型评估: 量化预测的质量

3.4. 模型持久化

3.5. 验证曲线: 绘制分数以评估模型

4. 检验

4.0 检验

4.1. 部分依赖图

5. 数据集转换

5.0 数据集转换

5.1. Pipeline（管道）和 FeatureUnion（特征联合）: 合并的评估器

5.2. 特征提取

5.3 预处理数据

5.4 缺失值插补

5.5. 无监督降维

5.6. 随机投影

5.7. 内核近似

5.8. 成对的矩阵, 类别和核函数

5.9. 预测目标

6. 数据集加载工具

6.0 数据集加载工具

6.1. 通用数据集 API

6.2. 玩具数据集

6.3 真实世界中的数据集

6.4. 样本生成器

6.5. 加载其他数据集

7. 使用scikit-learn计算

7.0 使用scikit-learn计算

7.1. 大规模计算的策略: 更大量的数据

7.2. 计算性能

7.3. 并行性、资源管理和配置

教程

使用 scikit-learn 介绍机器学习

关于科学数据处理的统计学习教程

机器学习: scikit-learn 中的设置以及预估对象

监督学习：从高维观察预测输出变量

模型选择：选择估计量及其参数

无监督学习: 寻求数据表示

把它们放在一起

寻求帮助

处理文本数据

选择正确的评估器(estimator.md

外部资源，视频和谈话

API 参考

常见问题

时光轴

sklearn

Docs »

scikit-learn (sklearn) 官方文档中文版

scikit-learn 是基于 Python 语言的机器学习工具

简单高效的数据挖掘和数据分析工具

可供大家在各种环境中重复使用

建立在 NumPy ，SciPy 和 matplotlib 上

开源，可商业使用 - BSD许可证

维护地址

Github

在线阅读

EPUB 格式

安装 scikit-learn

用户指南

1. 监督学习

1.1. 广义线性模型

1.2. 线性和二次判别分析

1.3. 内核岭回归

1.4. 支持向量机

1.5. 随机梯度下降

1.6. 最近邻

1.7. 高斯过程

1.8. 交叉分解

1.9. 朴素贝叶斯

1.10. 决策树

1.11. 集成方法

1.12. 多类和多标签算法

1.13. 特征选择

1.14. 半监督学习

1.15. 等式回归

1.16. 概率校准

1.17. 神经网络模型（有监督）

2. 无监督学习

2.1. 高斯混合模型

2.2. 流形学习

2.3. 聚类

2.4. 双聚类

2.5. 分解成分中的信号（矩阵分解问题）

2.6. 协方差估计

2.7. 新奇和异常值检测

2.8. 密度估计

2.9. 神经网络模型（无监督）

3. 模型选择和评估

3.1. 交叉验证：评估估算器的表现

3.2. 调整估计器的超参数

3.3. 模型评估: 量化预测的质量

3.4. 模型持久化

3.5. 验证曲线: 绘制分数以评估模型

4. 检验

4.1. 部分依赖图

5. 数据集转换

5.1. Pipeline（管道）和 FeatureUnion（特征联合）: 合并的评估器

5.2. 特征提取

5.3 预处理数据

5.4 缺失值插补

5.5. 无监督降维

5.6. 随机投影

5.7. 内核近似

5.8. 成对的矩阵, 类别和核函数

5.9. 预测目标 (y) 的转换

6. 数据集加载工具

6.1. 通用数据集 API

6.2. 玩具数据集

6.3 真实世界中的数据集

6.4. 样本生成器

6.5. 加载其他数据集

7. 使用scikit-learn计算

7.1. 大规模计算的策略: 更大量的数据

7.2. 计算性能

7.3. 并行性、资源管理和配置

教程

使用 scikit-learn 介绍机器学习

关于科学数据处理的统计学习教程

机器学习: scikit-learn 中的设置以及预估对象

监督学习：从高维观察预测输出变量

模型选择：选择估计量及其参数

无监督学习: 寻求数据表示

把它们放在一起

寻求帮助

处理文本数据

选择正确的评估器(estimator.md)

外部资源，视频和谈话

API 参考

常见问题

时光轴

贡献指南

项目当前处于校对阶段，请查看贡献指南，并在整体进度中领取任务。

请您勇敢地去翻译和改进翻译。虽然我们追求卓越，但我们并不要求您做到十全十美，因此请不要担心因为翻译上犯错——在大部分情况下，我们的服务器已经记录所有的翻译，因此您不必担心会因为您的失误遭到无法挽回的破坏。（改编自维基百科）

项目负责人

格式: GitHub + QQ

@mahaoyang：992635910

@loopyme：3322728009

飞龙：562826179

片刻：529815144

-- 负责人要求: (欢迎一起为 sklearn 中文版本做贡献)

热爱开源，喜欢装逼

长期使用 sklearn(至少0.5年) + 提交Pull Requests>=3

能够有时间及时优化页面 bug 和用户 issues

试用期: 2个月

欢迎联系: 片刻 529815144

项目协议

以各项目协议为准。

ApacheCN 账号下没有协议的项目，一律视为 CC BY-NC-SA 4.0。

建议反馈

在我们的 apachecn/pytorch-doc-zh github 上提 issue.

发邮件到 Email: apachecn@163.com.

在我们的 QQ群-搜索: 交流方式中联系群主/管理员即可.

赞助我们

sklearn PythonOK 协议：CC BY-NC-SA 4.0

Built with MkDocs using a theme provided by Read the Docs.

im钱包官方安卓版 数字资产服务平台

bitpie苹果版下载官网|sklearn官方文档

scikit-learn: machine learning in Python — scikit-learn 1.4.1 documentation

User guide: contents — scikit-learn 1.4.1 documentation

sklearn.svm.SVC — scikit-learn 1.4.1 documentation

sklearn.decomposition.PCA — scikit-learn 1.4.1 documentation

scikit-learn中文社区

sklearn.cluster.KMeans — scikit-learn 1.4.1 documentation

sklearn.model_selection.GridSearchCV — scikit-learn 1.4.1 documentation

2.3. Clustering — scikit-learn 1.4.1 documentation

sklearn

scikit-learn (sklearn) 官方文档中文版 - sklearn

im钱包官方安卓版
数字资产服务平台