«Previous Article|Table of Contents|Next Article»

[1] Liu Linan, Kou Yue, Sun Gaoshang, Shen Derong, et al. Duplicate identification model for deep web [J]. Journal of Southeast University (English Edition), 2008, 24 (3): 315-317. [doi:10.3969/j.issn.1003-7985.2008.03.015]
Copy

Duplicate identification model for deep web()

Share：

Journal of Southeast University (English Edition)[ISSN:1003-7985/CN:32-1325/N]

Volumn:: 24
Issue:: 2008 3

Page:: 315-317

Research Field:: Computer Science and Engineering

Publishing date:: 2008-09-30

Info

Title:: Duplicate identification model for deep web

Author(s):: Liu Linan; Kou Yue; Sun Gaoshang; Shen Derong; Yu Ge; College of Information Science and Engineering, Northeastern University, Shenyang 110004, China

Keywords:: duplicate records; deep web; data cleaning; semi-structured data

PACS:: TP311

DOI:: 10.3969/j.issn.1003-7985.2008.03.015

Abstract:: A duplicate identification model is presented to deal with semi-structured or unstructured data extracted from multiple data sources in the deep web.First, the extracted data is generated to the entity records in the data preprocessing module, and then, in the heterogeneous records processing module it calculates the similarity degree of the entity records to obtain the duplicate records based on the weights calculated in the homogeneous records processing module.Unlike traditional methods, the proposed approach is implemented without schema matching in advance.And multiple estimators with selective algorithms are adopted to reach a better matching efficiency.The experimental results show that the duplicate identification model is feasible and efficient.

References:

[1] Lee Mong Li, Hsu Wynne, Kothari Vijay.Cleaning the spurious links in data [J].IEEE Intelligent Systems, 2004, 19(2):28-33.
[2] Ma Weiying.Instance-based schema matching for web databases by domain-specific query probing [C]//Proceedings of the 30th VLDB Conference. Toronto, Canada, 2004:408-419.
[3] He Bin, Chang Kevin Chen-Chuan.Making holistic schema matching robust:an ensemble approach [C]//Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data. Chicago, Illinois, USA, 2005:429-438.
[4] Ukkonen E.Approximate string matching with q-grams and maximal matches [J].Theoretical Computer Science, 1992, 15(2):191-211.
[5] Waterman M S, Smith T F, Beyer W A.Some biological sequence metrics [J].Advances in Math, 1976, 20(4):367-387.
[6] Jaro M A.Unimatch:a record linkage system:user’s manual [R].Washington, DC:US Bureau of the Census, 1976:414-420.

Memo

Memo:: Biographies: Liu Linan(1983—), female, graduate;Shen Derong(corresponding author), female, doctor, professor, shenderong@ise.neu.edu.cn.
Foundation item: The National Natural Science Foundation of China(No.60673139).
Citation: Liu Linan, Kou Yue, Sun Gaoshang, et al.Duplicate identification model for deep web[J].Journal of Southeast University(English Edition), 2008, 24(3):315-317.

Last Update: 2008-09-20

[1] Liu Linan, Kou Yue, Sun Gaoshang, Shen Derong, et al. Duplicate identification model for deep web [J]. Journal of Southeast University (English Edition), 2008, 24 (3): 315-317. [doi:10.3969/j.issn.1003-7985.2008.03.015]
Copy

Duplicate identification model for deep web()

Share：

Journal of Southeast University (English Edition)[ISSN:1003-7985/CN:32-1325/N]

Info

References:

Memo

Common functions

Navigate

Tools

Statistics