|Table of Contents|

[1] Chong Zhihong, Ni Weiwei, Xu Lizhen, Lü Jianhua, et al. Min-wise hash function-based samplingover distributed data streams [J]. Journal of Southeast University (English Edition), 2009, 25 (4): 456-459. [doi:10.3969/j.issn.1003-7985.2009.04.008]
Copy

Min-wise hash function-based samplingover distributed data streams()
Share:

Journal of Southeast University (English Edition)[ISSN:1003-7985/CN:32-1325/N]

Volumn:
25
Issue:
2009 4
Page:
456-459
Research Field:
Computer Science and Engineering
Publishing date:
2009-12-30

Info

Title:
Min-wise hash function-based samplingover distributed data streams
Author(s):
Chong Zhihong Ni Weiwei Xu Lizhen Lü Jianhua Xie Yinghao
School of Computer Science and Engineering, Southeast University, Nanjing 210096, China
Keywords:
data streams aggregation min-wise hashing
PACS:
TP392
DOI:
10.3969/j.issn.1003-7985.2009.04.008
Abstract:
In order to avoid the redundant and inconsistent information in distributed data streams, a sampling method based on min-wise hash functions is designed and the practical semantics of the union of distributed data streams is defined. First, for each family of min-wise hash functions, the data with the minimum hash value are selected as local samples and the biased effect caused by frequent updates in a single data stream is filtered out. Secondly, for the same hash function, the sample with the minimum hash value is selected as the global sample and the local samples are combined at the center node to filter out the biased effect of duplicated updates. Finally, based on the obtained uniform samples, several aggregations on the defined semantics of the union of data streams are precisely estimated. The results of comparison tests on synthetic and real-life data streams demonstrate the effectiveness of this method.

References:

[1] Gibbons P B, Tirthapura S. Estimating simple functions on the union of data streams [C]//Annual ACM Symposium on Parallel Algorithms and Architectures. Crete Island, Greece, 2001: 281-290.
[2] Ganguly S, Garofalakis M, Rastogi R. Processing set expressions over continuous update streams [C]//Proceedings of the ACM SIGMOD International Conference on Management of Data. San Diego, CA, USA, 2003: 265-276.
[3] Ganguly S, Garofalakis M, Kumar A, et al. Join-distinct aggregate estimation over update streams [C]//Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. Baltimore, Maryland, USA, 2005: 259-270.
[4] Indyk P. Stable distributions, pseudorandom generators, embeddings and data stream computation [C]//Annual Symposium on Foundations of Computer Science Proceedings. New York, NY, USA, 2000: 189-197.
[5] Gilbert A C, Kotidis Y, Muthuukrishnan S, et al. How to summarize the universe: dynamic maintenance of quantiles [C]//Proceedings of the 28th Annual International Conference on Very Large Data Bases. Hong Kong, China, 2002: 454-465.
[6] Broder A Z, Charikar M, Frieze A M, et al. Min-wise independent permutations [C]//Conference Proceedings of the Annual ACM Symposium on Theory of Computing. Dallas, Texas, USA, 1998: 327-336.
[7] Flajolet P, Martin G N. Probabilistic counting algorithms for data base applications [J]. Journal of Computer and System Sciences, 1985, 31(2): 182-209.
[8] Hoeffding W. Probability inequalities for sums of bounded random variables [J]. Journal of the American Statistical Association, 1963, 58(1): 13-30.

Memo

Memo:
Biography: Chong Zhihong(1969—), male, doctor, lecturer, chongzhihong@seu.edu.cn.
Foundation items: The National Natural Science Foundation of China(No.60973023, 60603040), the Natural Science Foundation of Southeast University(No.KJ2009362).
Citation: Chong Zhihong, Ni Weiwei, Xu Lizhen, et al. Min-wise hash function-based sampling over distributed data streams[J]. Journal of Southeast University(English Edition), 2009, 25(4): 456-459.
Last Update: 2009-12-20