|Table of Contents|

[1] Liu Linan, Kou Yue, Sun Gaoshang, Shen Derong, et al. Duplicate identification model for deep web [J]. Journal of Southeast University (English Edition), 2008, 24 (3): 315-317. [doi:10.3969/j.issn.1003-7985.2008.03.015]
Copy

Duplicate identification model for deep web()
一种deep web 数据源下重复记录识别模型
Share:

Journal of Southeast University (English Edition)[ISSN:1003-7985/CN:32-1325/N]

Volumn:
24
Issue:
2008 3
Page:
315-317
Research Field:
Computer Science and Engineering
Publishing date:
2008-09-30

Info

Title:
Duplicate identification model for deep web
一种deep web 数据源下重复记录识别模型
Author(s):
Liu Linan Kou Yue Sun Gaoshang Shen Derong Yu Ge
College of Information Science and Engineering, Northeastern University, Shenyang 110004, China
刘丽楠 寇月 孙高尚 申德荣 于戈
东北大学信息科学与工程学院, 沈阳 110004
Keywords:
duplicate records deep web data cleaning semi-structured data
重复记录 deep web 数据清洗 半结构化数据
PACS:
TP311
DOI:
10.3969/j.issn.1003-7985.2008.03.015
Abstract:
A duplicate identification model is presented to deal with semi-structured or unstructured data extracted from multiple data sources in the deep web.First, the extracted data is generated to the entity records in the data preprocessing module, and then, in the heterogeneous records processing module it calculates the similarity degree of the entity records to obtain the duplicate records based on the weights calculated in the homogeneous records processing module.Unlike traditional methods, the proposed approach is implemented without schema matching in advance.And multiple estimators with selective algorithms are adopted to reach a better matching efficiency.The experimental results show that the duplicate identification model is feasible and efficient.
使用deep web数据源下重复记录识别模型对从多个deep web数据源中抽取出来的半结构化和无结构化的数据进行处理.首先, 在数据预处理模块中将所抽取的数据生成实体记录的形式, 然后, 在异构记录处理模块中利用在同构记录处理模块所得到的权值, 计算各实体记录的相似度, 得到重复记录.与传统的重复记录识别模型不同, 所提方法是在模式匹配未知的前提下实现的;并且采用带有可选算法的多个相似度估算器以达到更好的匹配效率.实验证明, 该重复记录识别模型是可行且有效的.

References:

[1] Lee Mong Li, Hsu Wynne, Kothari Vijay.Cleaning the spurious links in data [J].IEEE Intelligent Systems, 2004, 19(2):28-33.
[2] Ma Weiying.Instance-based schema matching for web databases by domain-specific query probing [C]//Proceedings of the 30th VLDB Conference. Toronto, Canada, 2004:408-419.
[3] He Bin, Chang Kevin Chen-Chuan.Making holistic schema matching robust:an ensemble approach [C]//Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data. Chicago, Illinois, USA, 2005:429-438.
[4] Ukkonen E.Approximate string matching with q-grams and maximal matches [J].Theoretical Computer Science, 1992, 15(2):191-211.
[5] Waterman M S, Smith T F, Beyer W A.Some biological sequence metrics [J].Advances in Math, 1976, 20(4):367-387.
[6] Jaro M A.Unimatch:a record linkage system:user’s manual [R].Washington, DC:US Bureau of the Census, 1976:414-420.

Memo

Memo:
Biographies: Liu Linan(1983—), female, graduate;Shen Derong(corresponding author), female, doctor, professor, shenderong@ise.neu.edu.cn.
Foundation item: The National Natural Science Foundation of China(No.60673139).
Citation: Liu Linan, Kou Yue, Sun Gaoshang, et al.Duplicate identification model for deep web[J].Journal of Southeast University(English Edition), 2008, 24(3):315-317.
Last Update: 2008-09-20