«Previous Article|Table of Contents|Next Article»

[1] Xia Shixiong, Li Wenchao, Zhou Yong, Zhang Lei, et al. Improved k-means clustering algorithm [J]. Journal of Southeast University (English Edition), 2007, 23 (3): 435-438. [doi:10.3969/j.issn.1003-7985.2007.03.027]
Copy

Improved k-means clustering algorithm()

一种改进的k-means聚类算法

Share：

Journal of Southeast University (English Edition)[ISSN:1003-7985/CN:32-1325/N]

Volumn:: 23
Issue:: 2007 3

Page:: 435-438

Research Field:: Automation

Publishing date:: 2007-09-30

Info

Title:: Improved k-means clustering algorithm

: 一种改进的k-means聚类算法

Author(s):: Xia Shixiong; Li Wenchao; Zhou Yong; Zhang Lei; Niu Qiang; School of Computer Science and Technology, China University of Mining and Technology, Xuzhou 221008, China

: 夏士雄; 李文超; 周勇; 张磊; 牛强; 中国矿业大学计算机科学与技术学院, 徐州 221008

Keywords:: clustering; k-means algorithm; silhouette coefficient

: 聚类; k-means算法; 轮廓系数

PACS:: TP18

DOI:: 10.3969/j.issn.1003-7985.2007.03.027

Abstract:: In allusion to the disadvantage of having to obtain the number of clusters of data sets in advance and the sensitivity to selecting initial clustering centers in the k-means algorithm, an improved k-means clustering algorithm is proposed.First, the concept of a silhouette coefficient is introduced, and the optimal clustering number K_opt of a data set with unknown class information is confirmed by calculating the silhouette coefficient of objects in clusters under different K values.Then the distribution of the data set is obtained through hierarchical clustering and the initial clustering-centers are confirmed.Finally, the clustering is completed by the traditional k-means clustering.By the theoretical analysis, it is proved that the improved k-means clustering algorithm has proper computational complexity.The experimental results of IRIS testing data set show that the algorithm can distinguish different clusters reasonably and recognize the outliers efficiently, and the entropy generated by the algorithm is lower.

: 针对k-means算法事先必须获知聚类数目以及难以确定初始中心的缺点, 提出了一种改进的k-means聚类算法.首先引入轮廓系数的概念, 通过计算不同K值下簇集中各对象的轮廓系数确定事先未知分类信息的数据集中所包含的最优聚类数K_opt;然后通过凝聚层次聚类的方法获得数据集的分布, 确定初始聚类中心;最后利用传统的k-means方法完成聚类.理论分析表明, 所提出的算法具有适度的计算复杂度.IRIS测试数据集的实验结果表明了该算法能够合理区分不同类型的簇集, 且可以有效地识别离群点, 聚合后的结果簇集具有较低的熵值.

References:

[1] Han Jiawei, Kamber Micheline.Data mining concepts and techniques[M].2nd ed.Beijing:China Machine Press, 2001.(in Chinese)
[2] Xu Rui, Wunsch Ⅱ Donald.Survey of clustering algorithms[J].IEEE Transactions on Neural Networks, 2005, 16(3):634-678.
[3] Su Ting, Dy Jennifer.A deterministic method for initializing k-means clustering[C]//Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence(ICTAI 2004).Boca Raton, FL, USA, 2004:784-786.
[4] Kanungo Tapas, Mount David M, Netanyahu Nathan S, et al.An efficient k-means clustering algorithm:analysis and implementation[J].IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002, 24(7):881-892.
[5] Ramze R M, Lelieveldt B P F, Reiber J H C.A new cluster validity index for the fuzzy c-mean[J].Pattern Recognition Letters, 1998, 19(3/4):237-246.
[6] Fisher R A.Iris plants database[EB/OL].(1988-07)[2007-04-30].http://www.ics.uci.edu/~mlearn/MLRepository.html.

Memo

Memo:: Biography: Xia Shixiong(1961—), male, professor, xiasx@cumt.edu.cn.

Last Update: 2007-09-20

Journal of Southeast University (English Edition)[ISSN:1003-7985/CN:32-1325/N]

Info

References:

Memo

Common functions

Navigate

Tools

Statistics