实际的APP应用程序
用示例(r实现,完整代码见附件)说明kmeans的使用方法,上述内容全部合并。
加载实验数据iris。 这个数据在机器学习领域被频繁使用。 主要是画的几个部分的大小,对花的品种进行分类。 实验中需要使用fpc库估计轮廓系数。 如果没有可以在install.packages上安装的东西。
#install.packages(FPC ) )。
是库(FPC )
库(数据集)
#Names(Iris )
是头(iris )
# sepal.lengthsepal.width petal.length petal.width species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
# 0-1归一化数据
min.max.norm
(x-min(x(x )/(max ) x )-min (x ) ) )
}
raw.data
norm.data
head(Norm.data )。
## sl sw pl pw
# 10.2222220.6250000.067796610.04166667
# 20.16666670.41666670.067796610.04166667
# 30.11111110.500000.050847460.0416667
# 40.08333330.4583330.084745760.04166667
# 50.19444440.666670.067796610.04166667
# 60.3055560.79166670.11864407.12500000
对iris的4个feature进行数据归一化,每个feature为有花部位的大小。
# k取2到8,评价k
K
圆角
rst
print(paste('k=',I ) )
mean(sapply(1:round,function(r ) )。
print(paste('round ',r ) )
结果
统计信息
stats$avg.silwidth
() )
() )
# # [1] '
## [1] 'Round 1'
## [1] 'Round 2'
## [1] 'Round 3'
## [1] 'Round 4'
## [1] 'Round 5'
## [1] 'Round 6'
## [1] 'Round 7'
## [1] 'Round 8'
## [1] 'Round 9'
## [1] 'Round 10 '
## [1] 'Round 11 '
## [1] 'Round 12 '
## [1] 'Round 13 '
## [1] 'Round 14 '
## [1] 'Round 15 '
## [1] 'Round 16 '
## [1] 'Round 17 '
## [1] 'Round 18 '
## [1] 'Round 19 '
## [1] 'Round 20 '
## [1] 'Round 21 '
## [1] 'Round 22 '
## [1] 'Round 23 '
## [1] 'Round 24 '
## [1] 'Round 25 '
## [1] 'Round 26 '
## [1] 'Round 27 '
## [1] 'Round 28 '
## [1] 'Round 29 '
## [1] 'Round 30 '
## [1] 'K=3'
## [1] 'Round 1'
## [1] 'Round 2'
## [1] 'Round 3'
## [1] 'Round 4'
## [1] 'Round 5'
## [1] 'Round 6'
## [1] 'Round 7'
## [1
] "Round 8"## [1] "Round 9"
## [1] "Round 10"
## [1] "Round 11"
## [1] "Round 12"
## [1] "Round 13"
## [1] "Round 14"
## [1] "Round 15"
## [1] "Round 16"
## [1] "Round 17"
## [1] "Round 18"
## [1] "Round 19"
## [1] "Round 20"
## [1] "Round 21"
## [1] "Round 22"
## [1] "Round 23"
## [1] "Round 24"
## [1] "Round 25"
## [1] "Round 26"
## [1] "Round 27"
## [1] "Round 28"
## [1] "Round 29"
## [1] "Round 30"
## [1] "K= 4"
## [1] "Round 1"
## [1] "Round 2"
## [1] "Round 3"
## [1] "Round 4"
## [1] "Round 5"
## [1] "Round 6"
## [1] "Round 7"
## [1] "Round 8"
## [1] "Round 9"
## [1] "Round 10"
## [1] "Round 11"
## [1] "Round 12"
## [1] "Round 13"
## [1] "Round 14"
## [1] "Round 15"
## [1] "Round 16"
## [1] "Round 17"
## [1] "Round 18"
## [1] "Round 19"
## [1] "Round 20"
## [1] "Round 21"
## [1] "Round 22"
## [1] "Round 23"
## [1] "Round 24"
## [1] "Round 25"
## [1] "Round 26"
## [1] "Round 27"
## [1] "Round 28"
## [1] "Round 29"
## [1] "Round 30"
## [1] "K= 5"
## [1] "Round 1"
## [1] "Round 2"
## [1] "Round 3"
## [1] "Round 4"
## [1] "Round 5"
## [1] "Round 6"
## [1] "Round 7"
## [1] "Round 8"
## [1] "Round 9"
## [1] "Round 10"
## [1] "Round 11"
## [1] "Round 12"
## [1] "Round 13"
## [1] "Round 14"
## [1] "Round 15"
## [1] "Round 16"
## [1] "Round 17"
## [1] "Round 18"
## [1] "Round 19"
## [1] "Round 20"
## [1] "Round 21"
## [1] "Round 22"
## [1] "Round 23"
## [1] "Round 24"
## [1] "Round 25"
## [1] "Round 26"
## [1] "Round 27"
## [1] "Round 28"
## [1] "Round 29"
## [1] "Round 30"
## [1] "K= 6"
## [1] "Round 1"
## [1] "Round 2"
## [1] "Round 3"
## [1] "Round 4"
## [1] "Round 5"
## [1] "Round 6"
## [1] "Round 7"
## [1] "Round 8"
## [1] "Round 9"
## [1] "Round 10"
## [1] "Round 11"
## [1] "Round 12"
## [1] "Round 13"
## [1] "Round 14"
## [1] "Round 15"
## [1] "Round 16"
## [1] "Round 17"
## [1] "Round 18"
## [1] "Round 19"
## [1] "Round 20"
## [1] "Round 21"
## [1] "Round 22"
## [1] "Round 23"
## [1] "Round 24"
## [1] "Round 25"
## [1] "Round 26"
## [1] "Round 27"
## [1] "Round 28"
## [1] "Round 29"
## [1] "Round 30"
## [1] "K= 7"
## [1] "Round 1"
## [1] "Round 2"
## [1] "Round 3"
## [1] "Round 4"
## [1] "Round 5"
## [1] "Round 6"
## [1] "Round 7"
## [1] "Round 8"
## [1] "Round 9"
## [1] "Round 10"
## [1] "Round 11"
## [1] "Round 12"
## [1] "Round 13"
## [1] "Round 14"
## [1] "Round 15"
## [1] "Round 16"
## [1] "Round 17"
## [1] "Round 18"
## [1] "Round 19"
## [1] "Round 20"
## [1] "Round 21"
## [1] "Round 22"
## [1] "Round 23"
## [1] "Round 24"
## [1] "Round 25"
## [1] "Round 26"
## [1] "Round 27"
## [1] "Round 28"
## [1] "Round 29"
## [1] "Round 30"
## [1] "K= 8"
## [1] "Round 1"
## [1] "Round 2"
## [1] "Round 3"
## [1] "Round 4"
## [1] "Round 5"
## [1] "Round 6"
## [1] "Round 7"
## [1] "Round 8"
## [1] "Round 9"
## [1] "Round 10"
## [1] "Round 11"
## [1] "Round 12"
## [1] "Round 13"
## [1] "Round 14"
## [1] "Round 15"
## [1] "Round 16"
## [1] "Round 17"
## [1] "Round 18"
## [1] "Round 19"
## [1] "Round 20"
## [1] "Round 21"
## [1] "Round 22"
## [1] "Round 23"
## [1] "Round 24"
## [1] "Round 25"
## [1] "Round 26"
## [1] "Round 27"
## [1] "Round 28"
## [1] "Round 29"
## [1] "Round 30"
plot(K,rst,type='l',main='轮廓系数与K的关系', ylab='轮廓系数')
评估k,由于一般K不会太大,太大了也不易于理解,所以遍历K为2到8。由于kmeans具有一定随机性,并不是每次都收敛到全局最小,所以针对每一个k值,重复执行30次,取并计算轮廓系数,最终取平均作为最终评价标准,可以看到如上的示意图。
当k取2时,有最大的轮廓系数,虽然实际上有3个种类。聚类完成后,有源原始数据是4纬,无法可视化,所以通过多维定标(Multidimensional scaling)将纬度将至2维,查看聚类效果。
# 降纬度观察
old.par
k
clu
mds
plot(mds, col=clu$cluster, main='kmeans聚类 k=2', pch = 19)
plot(mds, col=iris$Species, main='原始聚类', pch = 19)
par(old.par)
可以发现原始分类中和聚类中左边那一簇的效果还是拟合的很好的,右测原始数据就连在一起,kmeans无法很好的区分,需要寻求其他方法。
kmeans最佳实践
随机选取训练数据中的k个点作为起始点
当k值选定后,随机计算n次,取得到最小开销函数值的k作为最终聚类结果,避免随机引起的局部最优解
手肘法选取k值:绘制出k–开销函数闪点图,看到有明显拐点(如下)的地方,设为k值,可以结合轮廓系数。
k值有时候需要根据应用场景选取,而不能完全的依据评估参数选取。