|Table of Contents|

[1] Li Huayang, Liu Yubao, Li Youkui, et al. Application of fuzzy equivalence theory in data cleaning [J]. Journal of Southeast University (English Edition), 2004, 20 (4): 454-457. [doi:10.3969/j.issn.1003-7985.2004.04.012]
Copy

Application of fuzzy equivalence theory in data cleaning()
模糊等值理论在数据清理中的应用
Share:

Journal of Southeast University (English Edition)[ISSN:1003-7985/CN:32-1325/N]

Volumn:
20
Issue:
2004 4
Page:
454-457
Research Field:
Computer Science and Engineering
Publishing date:
2004-12-30

Info

Title:
Application of fuzzy equivalence theory in data cleaning
模糊等值理论在数据清理中的应用
Author(s):
Li Huayang1 2 Liu Yubao1 Li Youkui3
1College of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China
2UFsoft School of Software, Jiangxi University of Finance and Economics, Nanchang 330013, China
3Nanjing Institute, Huawei Technologies Co., Ltd, Nanjing 210001, China
李华旸1 2 刘玉葆1 李又奎3
1华中科技大学计算机科学与技术学院, 武汉 430074; 2江西财经大学用友软件学院, 南昌 330013; 3华为技术有限公司南京研究所, 南京 210001
Keywords:
equivalence theory equivalence degree data cleaning
等值理论 等值度 数据清理
PACS:
T391
DOI:
10.3969/j.issn.1003-7985.2004.04.012
Abstract:
This paper presents a rule merging and simplifying method and an improved analysis deviation algorithm. The fuzzy equivalence theory avoids the rigid way(either this or that)of traditional equivalence theory. During a data cleaning process task, some rules exist such as “included”/“being included” relations with each other. The equivalence degree of the being-included rule is smaller than that of the including rule, so a rule merging and simplifying method is introduced to reduce the total computing time. And this kind of relation will affect the deviation of fuzzy equivalence degree. An improved analysis deviation algorithm that omits the influence of the included rules’ equivalence degree is also presented. Normally the duplicate records are logged in a file, and users have to check and verify them one by one. It’s time-cost. The proposed algorithm can save users’ labor during duplicate records checking. Finally, an experiment is presented which demonstrates the possibility of the rule.
提出了规则合并的优化方法和重复记录聚类清除的方法.应用模糊等值理论, 避免了传统等值理论非此即彼的僵硬方式, 但清理过程中部分规则可能存在包含与被包含的关系, 被包含的规则其等值度显然会相对较小, 根据用户阀值提出了规则合并的优化方法, 可减少重复记录的计算时间.基于同样的原因, 规则间的包含与被包含关系将影响模糊等值度的误差分析, 因此提出了利用忽略被包含的规则等值度提高误差分析精度的改进模糊等值理论误差分析方法.重复记录的核实通常需要人工逐条检测, 易于出错, 本文提出的聚类算法, 可节省大量的用户劳动.最后给出一个实验, 表明了规则优化的可能性.

References:

[1] Rahm E, Hai Do H. Data cleaning: problems and current approaches[J]. Data Engineering, 2000, 23(4): 3-13.
[2] Davidson Susan B, Kosky Anthony S. Specifying database transformations in WOL [J]. Data Engineering, 1999, 22(1): 25-31.
[3] Haas Laura, Miller Renee, Niswonger Bartholomew, et al. Transforming heterogeneous data with database middleware: beyond integration [J]. Data Engineering, 1999, 22(1): 31-37.
[4] Raman V, Joseph M. Potter’s wheel: an interactive data cleaning system[A]. In: Very Large Data Bases [C]. ACM Press, 2001. 381-390.
[5] GalhardasHelena, Florescu Daniela, Shasha Dennis. Declarative data cleaning: language, model and algorithms[A]. In: Very Large Data Bases [C]. ACM Press, 2001. 371-380.
[6] Hernandez Mauricio A, Stolfo Salvatore J. The merge/purge problem for large databases [A]. In: SIGMOD Conf [C]. ACM Press, 1995. 127-138.
[7] Li Huayang, Liu Yubao, Li Youkui. The equivalence theory based on fuzzy theory [A]. In: The 3rd International Conf on Machine Learning and Cybernetics [C]. Shanghai, 2004. 1272-1276.

Memo

Memo:
Biography: Li Huayang(1973—), male, doctor, associate professor, dariusli@tom.com.
Last Update: 2004-12-20