«Previous Article|Table of Contents|Next Article»

[1] Luo Na, Zuo Wanli, Yuan Fuyu, et al. Using ontology semantics to improve text documents clustering [J]. Journal of Southeast University (English Edition), 2006, 22 (3): 370-374. [doi:10.3969/j.issn.1003-7985.2006.03.017]
Copy

Using ontology semantics to improve text documents clustering()

使用本体语义提高文本聚类

Share：

Journal of Southeast University (English Edition)[ISSN:1003-7985/CN:32-1325/N]

Volumn:: 22
Issue:: 2006 3

Page:: 370-374

Research Field:: Automation

Publishing date:: 2006-09-30

Info

Title:: Using ontology semantics to improve text documents clustering

: 使用本体语义提高文本聚类

Author(s):: Luo Na¹; 2; Zuo Wanli¹; Yuan Fuyu¹; Zhang Jingbo²; Zhang Huijie²; ¹ College of Computer Science and Technology, Jilin University, Changchun 130012, China
² School of Computer Science, Northeast Normal University, Changchun 130024, China

: 罗娜¹; 2; 左万利¹; 袁福宇¹; 张靖波²; 张慧杰²; ¹吉林大学计算机科学与技术学院, 长春 130012; ²东北师范大学计算机学院, 长春 130024

Keywords:: ontology; text clustering; lexicon; WordNet

: 本体; 文本聚类; 词典; WordNet

PACS:: TP181

DOI:: 10.3969/j.issn.1003-7985.2006.03.017

Abstract:: In order to improve the clustering results and select in the results, the ontology semantic is combined with document clustering.A new document clustering algorithm based WordNet in the phrase of document processing is proposed.First, every word vector by new entities is extended after the documents are represented by tf-idf.Then the feature extracting algorithm is applied for the documents.Finally, the algorithm of ontology aggregation clustering(OAC)is proposed to improve the result of document clustering.Experiments are based on the data set of Reuters 20 News Group, and experimental results are compared with the results obtained by mutual information(MI).The conclusion draws that the proposed algorithm of document clustering based on ontology is better than the other existed clustering algorithms such as MNB, CLUTO, co-clustering, etc.

: 为了提高聚类结果和允许在结果中进行选择, 将本体语义与文档聚类相结合, 在文档处理过程中提出了基于WordNet的新的文档聚类算法.首先通过tf-idf对文档进行了表示, 为了将WordNet的概念出现在文档集合中, 通过新的实体对每一个单词向量进行扩展.其次, 运用特征提取算法对文档进行特征提取.最后提出了本体集合聚类算法用以提高文本的聚类效果.实验构建在Reuters 20新闻组的数据基础上, 应用互信息作为试验结果的比较.结果表明:与已经存在的一些算法如MNB, CLUTO, co-clustering等相比, 基于本体的聚类算法在文本聚类上有很明显的提高.

References:

[1] Kim H J, Lee S G.A semi-supervised document clustering technique for information and organization [A].In:Proc of the Ninth International Conference on Information and Knowledge Management [C].McLean, Virginia, 2002.159-168.
[2] Brusilovsky P.Methods and techniques of adaptive hypermedia [J].User Modeling and User Adapted Interaction, 1996, 6(2, 3):87-129.
[3] Berners-Lee T, Hendler J, Lassila O.The semantic web[J].Scientific American, 2001, 184(5):34-43.
[4] Abdelali Ahmed, Cowie James, Farwell David, et al.Cross-language information retrieval using ontology [A].In:Proc of Traitment Automatique des Languages Naturelles [C].Batz-sur-Mer, France, 2003.236-248.
[5] Porter M F.An algorithm for suffix stripping [J].Program, 1980, 14(3):130-137.
[6] Gruber T.A translation approach to portable ontology specifications [J].An International Journal of Knowledge Acquisition for Knowledge-Based Systems, 1993, 5(2):62-69.
[7] Miller G.WordNet:a lexical database for English [J].Communications of the Association for Computing Machinery, 1995, 38(11):39-41.
[8] Karypis G, Zhao Y. Evaluation of hierarchical clustering algorithms for document datasets [A]. In: Proc of the International Conference on Information and Knowledge Management[C]. New York, 2002. 515-524.
[9] Strehl A, Ghosh J. Cluster ensembles—a knowledge reuse framework for combining partitions [J]. Journal of Machine Learning Research, 2002, 3:583-617.
[10] Strehl A, Ghosh J, Mooney R J. Impact of similarity measures on web-page clustering [A]. In: Proc of AAAI Workshop on AI for Web Search[C]. Austin, Texas, 2000.58-64.

Memo

Memo:: Biographies: Luo Na(1980—), female, graduate;Zuo Wanli(corresponding author), male, doctor, professor, wanli@jlu.edu.cn.

Last Update: 2006-09-20

Journal of Southeast University (English Edition)[ISSN:1003-7985/CN:32-1325/N]

Info

References:

Memo

Common functions

Navigate

Tools

Statistics