|Table of Contents|

[1] Chong Zhihong, Ni Weiwei, Xu Lizhen, Lü Jianhua, et al. Min-wise hash function-based samplingover distributed data streams [J]. Journal of Southeast University (English Edition), 2009, 25 (4): 456-459. [doi:10.3969/j.issn.1003-7985.2009.04.008]
Copy

Min-wise hash function-based samplingover distributed data streams()
分布数据流上基于Min-wise散列函数的采样
Share:

Journal of Southeast University (English Edition)[ISSN:1003-7985/CN:32-1325/N]

Volumn:
25
Issue:
2009 4
Page:
456-459
Research Field:
Computer Science and Engineering
Publishing date:
2009-12-30

Info

Title:
Min-wise hash function-based samplingover distributed data streams
分布数据流上基于Min-wise散列函数的采样
Author(s):
Chong Zhihong Ni Weiwei Xu Lizhen Lü Jianhua Xie Yinghao
School of Computer Science and Engineering, Southeast University, Nanjing 210096, China
崇志宏 倪巍伟 徐立臻 吕建华 谢英豪
东南大学计算机科学与工程学院, 南京 210096
Keywords:
data streams aggregation min-wise hashing
数据流 聚集 Min-wise散列
PACS:
TP392
DOI:
10.3969/j.issn.1003-7985.2009.04.008
Abstract:
In order to avoid the redundant and inconsistent information in distributed data streams, a sampling method based on min-wise hash functions is designed and the practical semantics of the union of distributed data streams is defined. First, for each family of min-wise hash functions, the data with the minimum hash value are selected as local samples and the biased effect caused by frequent updates in a single data stream is filtered out. Secondly, for the same hash function, the sample with the minimum hash value is selected as the global sample and the local samples are combined at the center node to filter out the biased effect of duplicated updates. Finally, based on the obtained uniform samples, several aggregations on the defined semantics of the union of data streams are precisely estimated. The results of comparison tests on synthetic and real-life data streams demonstrate the effectiveness of this method.
针对分布数据流中存在的冗余和不一致信息问题, 提出了一种基于Min-wise散列的采样方法, 并定义了反映应用需求的分布数据流并的语义. 首先, 对于每一族Min-wise散列函数选取具有最小散列值的数据作为局部样本, 滤除单个数据流中的频繁更新对采样偏斜的影响. 然后, 对于相同散列函数产生的样本选取具有最小散列值的样本作为全局样本, 完成局部样本集在中心节点的合并, 滤除在分布节点上的重复更新对样本偏斜的影响. 最后, 利用获得的均匀样本集, 在多种数据流并的语义上精确估计聚集函数的值. 基于人造数据和真实数据的对比试验验证了该方法的有效性.

References:

[1] Gibbons P B, Tirthapura S. Estimating simple functions on the union of data streams [C]//Annual ACM Symposium on Parallel Algorithms and Architectures. Crete Island, Greece, 2001: 281-290.
[2] Ganguly S, Garofalakis M, Rastogi R. Processing set expressions over continuous update streams [C]//Proceedings of the ACM SIGMOD International Conference on Management of Data. San Diego, CA, USA, 2003: 265-276.
[3] Ganguly S, Garofalakis M, Kumar A, et al. Join-distinct aggregate estimation over update streams [C]//Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. Baltimore, Maryland, USA, 2005: 259-270.
[4] Indyk P. Stable distributions, pseudorandom generators, embeddings and data stream computation [C]//Annual Symposium on Foundations of Computer Science Proceedings. New York, NY, USA, 2000: 189-197.
[5] Gilbert A C, Kotidis Y, Muthuukrishnan S, et al. How to summarize the universe: dynamic maintenance of quantiles [C]//Proceedings of the 28th Annual International Conference on Very Large Data Bases. Hong Kong, China, 2002: 454-465.
[6] Broder A Z, Charikar M, Frieze A M, et al. Min-wise independent permutations [C]//Conference Proceedings of the Annual ACM Symposium on Theory of Computing. Dallas, Texas, USA, 1998: 327-336.
[7] Flajolet P, Martin G N. Probabilistic counting algorithms for data base applications [J]. Journal of Computer and System Sciences, 1985, 31(2): 182-209.
[8] Hoeffding W. Probability inequalities for sums of bounded random variables [J]. Journal of the American Statistical Association, 1963, 58(1): 13-30.

Memo

Memo:
Biography: Chong Zhihong(1969—), male, doctor, lecturer, chongzhihong@seu.edu.cn.
Foundation items: The National Natural Science Foundation of China(No.60973023, 60603040), the Natural Science Foundation of Southeast University(No.KJ2009362).
Citation: Chong Zhihong, Ni Weiwei, Xu Lizhen, et al. Min-wise hash function-based sampling over distributed data streams[J]. Journal of Southeast University(English Edition), 2009, 25(4): 456-459.
Last Update: 2009-12-20