机器学习里的KMeans 算法_CDA答疑社区

热线电话：13121318867

登录

啊啊啊啊啊吖

2019-01-15 阅读量: 921

机器学习里的KMeans 算法

KMeans 算法通过试图分离 n groups of equal variance（n 个相等方差组）的样本来聚集数据，minimizing （最小化）称为 inertia 或者 within-cluster sum-of-squares （簇内和平方）的 criterion （标准）。该算法需要指定 number of clusters （簇的数量）。它可以很好地 scales （扩展）到 large number of samples（大量样本），并已经被广泛应用于许多不同领域的应用领域。

k-means 算法将一组

样本

划分成

不相交的 clusters （簇）

, 每个都用 cluster （该簇）中的样本的均值

$\mu_j$

描述。这个 means （均值）通常被称为 cluster（簇）的 “centroids（质心）”; 注意，它们一般不是从

中挑选出的点，虽然它们是处在同一个 space（空间）。 K-means（K-均值）算法旨在选择最小化 inertia（惯性） 或 within-cluster sum of squared（簇内和的平方和）的标准的 centroids（质心）:

$\sum_{i=0}^{n}\min_{\mu_j \in C}(||x_j - \mu_i||^2)$

Inertia（惯性）, 或 the within-cluster sum of squares（簇内和平方差） criterion（标准）,可以被认为是 internally coherent clusters （内部想干聚类）的 measure （度量）。它有各种缺点:

Inertia（惯性）假设 clusters （簇）是 convex（凸）的和 isotropic （各项同性），这并不是总是这样。它对 elongated clusters （细长的簇）或具有不规则形状的 manifolds 反应不佳。
Inertia（惯性）不是一个 normalized metric（归一化度量）: 我们只知道 lower values （较低的值）是更好的，并且零是最优的。但是在 very high-dimensional spaces （非常高维的空间）中，欧几里得距离往往会变得 inflated （膨胀）（这就是所谓的 “curse of dimensionality （维度诅咒/维度惩罚）”）。在 k-means 聚类之前运行诸如 PCA之类的 dimensionality reduction algorithm （降维算法）可以减轻这个问题并加快计算速度。

http://sklearn.apachecn.org/cn/0.19.0/_images/sphx_glr_plot_kmeans_assumptions_0011.png

0.0000

0

3

关注作者

收藏

评论(0)

发表评论

暂无数据

CDA考试动态

CDA报考指南

推荐帖子