|Table of Contents|

[1] Qiu Yong, Lan Yongjie,. Algorithms of mining data records from website automatically [J]. Journal of Southeast University (English Edition), 2006, 22 (3): 423-425. [doi:10.3969/j.issn.1003-7985.2006.03.028]
Copy

Algorithms of mining data records from website automatically()
Share:

Journal of Southeast University (English Edition)[ISSN:1003-7985/CN:32-1325/N]

Volumn:
22
Issue:
2006 3
Page:
423-425
Research Field:
Computer Science and Engineering
Publishing date:
2006-09-30

Info

Title:
Algorithms of mining data records from website automatically
Author(s):
Qiu Yong Lan Yongjie
School of Information and Electronic Engineering, Shandong Institute of Business and Technology, Yantai 264005, China
Keywords:
data mining data record website isomorphic page
PACS:
TP311
DOI:
10.3969/j.issn.1003-7985.2006.03.028
Abstract:
In order to improve the accuracy and integrality of mining data records from the web, the concepts of isomorphic page and directory page and three algorithms are proposed.An isomorphic web page is a set of web pages that have uniform structure, only differing in main information.A web page which contains many links that link to isomorphic web pages is called a directory page.Algorithm 1 can find directory web pages in a web using adjacent links similar analysis method.It first sorts the link, and then counts the links in each directory.If the count is greater than a given valve then finds the similar sub-page links in the directory and gives the results.A function for an isomorphic web page judgment is also proposed.Algorithm 2 can mine data records from an isomorphic page using a noise information filter.It is based on the fact that the noise information is the same in two isomorphic pages, only the main information is different.Algorithm 3 can mine data records from an entire website using the technology of spider.The experiment shows that the proposed algorithms can mine data records more intactly than the existing algorithms.Mining data records from isomorphic pages is an efficient method.

References:

[1] Amitay E, Paris C.Automatically summarizing web sites:is there a way around it?[A].In:Proc of the 9th International Conference on Information and Knowledge Management[C].New York:ACM Press, 2000.173-179.
[2] Cohen W, McCallum A, Quass D.Learning to understand the web[J].IEEE Data Engineering Bulletin, 2000, 23(3):17-24.
[3] Embley D, Jiang Y, Ng Y.Record-boundary discovery in web documents[A].In:Proc of SIGMOD[C].Philadelphia, 1999.213-219.
[4] Han J, Chang K C C.Data mining for web intelligence [J].IEEE Computer, 2003, 10(5):51-62.
[5] Buttler D, Liu L, Pu C.A fully automated extraction system for the world wide web[A].In:Proc of the 21st International Conference on Distributed Computing [C].Phoenix:IEEE Press, 2003.361-370.
[6] Liu B, Grossman R, Zhai Y.Mining data records in web pages[J].UIC Technical Report, 2004, 5(1):35-47.
[7] Zaki Mohammed J. Efficiently mining frequent trees in a forest:algorithms and applications [J].IEEE Transactions on Knowledge and Data Engineering, 2005, 17(8):516-527.
[8] Sun J T, Zeng H J, Liu H, et al.Cubesvd:a novel approach to personalized web search[A].In:Proceedings of the 14th International Conference on World Wide Web[C].New York:ACM Press, 2005.652-662.

Memo

Memo:
Biography: Qiu Yong(1959—), male, professor, sdytqy@163.com.
Last Update: 2006-09-20