|Table of Contents|

[1] Liu Linan, Kou Yue, Sun Gaoshang, Shen Derong, et al. Duplicate identification model for deep web [J]. Journal of Southeast University (English Edition), 2008, 24 (3): 315-317. [doi:10.3969/j.issn.1003-7985.2008.03.015]
Copy

Duplicate identification model for deep web()
Share:

Journal of Southeast University (English Edition)[ISSN:1003-7985/CN:32-1325/N]

Volumn:
24
Issue:
2008 3
Page:
315-317
Research Field:
Computer Science and Engineering
Publishing date:
2008-09-30

Info

Title:
Duplicate identification model for deep web
Author(s):
Liu Linan Kou Yue Sun Gaoshang Shen Derong Yu Ge
College of Information Science and Engineering, Northeastern University, Shenyang 110004, China
Keywords:
duplicate records deep web data cleaning semi-structured data
PACS:
TP311
DOI:
10.3969/j.issn.1003-7985.2008.03.015
Abstract:
A duplicate identification model is presented to deal with semi-structured or unstructured data extracted from multiple data sources in the deep web.First, the extracted data is generated to the entity records in the data preprocessing module, and then, in the heterogeneous records processing module it calculates the similarity degree of the entity records to obtain the duplicate records based on the weights calculated in the homogeneous records processing module.Unlike traditional methods, the proposed approach is implemented without schema matching in advance.And multiple estimators with selective algorithms are adopted to reach a better matching efficiency.The experimental results show that the duplicate identification model is feasible and efficient.

References:

[1] Lee Mong Li, Hsu Wynne, Kothari Vijay.Cleaning the spurious links in data [J].IEEE Intelligent Systems, 2004, 19(2):28-33.
[2] Ma Weiying.Instance-based schema matching for web databases by domain-specific query probing [C]//Proceedings of the 30th VLDB Conference. Toronto, Canada, 2004:408-419.
[3] He Bin, Chang Kevin Chen-Chuan.Making holistic schema matching robust:an ensemble approach [C]//Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data. Chicago, Illinois, USA, 2005:429-438.
[4] Ukkonen E.Approximate string matching with q-grams and maximal matches [J].Theoretical Computer Science, 1992, 15(2):191-211.
[5] Waterman M S, Smith T F, Beyer W A.Some biological sequence metrics [J].Advances in Math, 1976, 20(4):367-387.
[6] Jaro M A.Unimatch:a record linkage system:user’s manual [R].Washington, DC:US Bureau of the Census, 1976:414-420.

Memo

Memo:
Biographies: Liu Linan(1983—), female, graduate;Shen Derong(corresponding author), female, doctor, professor, shenderong@ise.neu.edu.cn.
Foundation item: The National Natural Science Foundation of China(No.60673139).
Citation: Liu Linan, Kou Yue, Sun Gaoshang, et al.Duplicate identification model for deep web[J].Journal of Southeast University(English Edition), 2008, 24(3):315-317.
Last Update: 2008-09-20