|Table of Contents|

[1] Yan Duanwu, Li Xiaopeng, Wang Lei, Cheng Xiao, et al. Ontology-based similarity measure for text clustering [J]. Journal of Southeast University (English Edition), 2006, 22 (3): 389-393. [doi:10.3969/j.issn.1003-7985.2006.03.021]

Ontology-based similarity measure for text clustering()

Journal of Southeast University (English Edition)[ISSN:1003-7985/CN:32-1325/N]

2006 3
Research Field:
Computer Science and Engineering
Publishing date:


Ontology-based similarity measure for text clustering
Yan Duanwu1 Li Xiaopeng2 Wang Lei1 Cheng Xiao1
1Department of Information Management, Nanjing University of Science and Technology, Nanjing 210094, China
2Library, Nanjing University of Science and Technology, Nanjing 210094, China
similarity measure text clustering ontology information retrieval system
A method that combines category-based and keyword-based concepts for a better information retrieval system is introduced.To improve document clustering, a document similarity measure based on cosine vector and keywords frequency in documents is proposed, but also with an input ontology.The ontology is domain specific and includes a list of keywords organized by degree of importance to the categories of the ontology, and by means of semantic knowledge, the ontology can improve the effects of document similarity measure and feedback of information retrieval systems.Two approaches to evaluating the performance of this similarity measure and the comparison with standard cosine vector similarity measure are also described.


[1] Frakes W B, Baeza-Yates R.Information retrieval data structure and algorithms [M].Englewood Cliffs, New Jersey:Prentice Hall, 1992.
[2] Baeza-Yates R, Ribeiro-Neto B.Modern information retrieval [M].Translated by Wang Zhijin.Beijing:China Machine Press, 2005.(in Chinese)
[3] Fisher D.Iterative optimization and simplification of hierarchical clusterings [J].Journal of Artificial Intelligence Research, 1996, 27(4):147-179.
[4] Frigui H, Nasraoui O.Simultaneous clustering and dynamic keyword weighting for text documents [A].In:Berry M W, ed.Survey of Text Mining[C].New York:Springer, 2003.45-72.
[5] Klose A, Nurnberger A, Kruse R, et al.Interactive text retrieval based on document similarities, physics and chemistry of the earth, part A:solid earth and geodesy [M].Amsterdam, Netherlands:Elsevier, 2000.649-654.
[6] Uzuner O, Davis R, Katz B.Using empirical methods for evaluating expression and content similarity [EB/OL].(2004-12-30)[2006-02-10].http://people.csail.mit.edu/ozlem/Uzuner-HICSS04.pdf.
[7] Shyu M L, Chen S C, Shu C M.Affinity-based probabilistic reasoning and document clustering on the WWW[A].In:Proceedings of the 24th IEEE Computer Society International Computer Software and Applications Conference[C].Washington, DC:IEEE Computer Society, 2000.149-154.
[8] Yaniv R, Souroujon O.Iterative double clustering for unsupervised and semi-supervised learning [EB/OL].(2002-12-30)[2006-02-20].http://books.nips.cc/papers/files/nips14/AA24.pdf.
[9] Niles I, Pease A.Linking lexicons and ontologies:mapping wordNet to the suggested upper merged ontology[A].In:Proceedings of the International Conference on Information and Knowledge Engineering [C].Las Vegas, Nevada, 2003.161-172.
[10] Gan K W, Wong P W.Annotating information structures in Chinese text using HowNet[A].In:Proc of the 2nd Chinese Language Processing Workshop, Association for Computational Linguistics Conference [C].Hong Kong, 2002.85-92.
[11] Wulfekuhler M R, Punch W.Finding salient features for personal Web page categories [EB/OL].(2001-07-23)[2005-12-20].http://www.cps.msu.edu/wulfekuh/research/PAPER118.ps.
[12] Slonim N, Tishby N.Document clustering using word clusters via the information bottleneck [A].In:Proc of the 23rd Annual Intl ACM SIGIR Conf on Research and Development in Information Retrieval[C].Athens, Greece, 2000.208-215.


Biography: Yan Duanwu(1976—), male, doctor, lecturer, yanwu-nju@163.com.
Last Update: 2006-09-20