|Table of Contents|

[1] Shen Yang, Zhu Chanyuan, Li Shuchen,. System of twice-gathering informationand research of information fingerprint HashTrie [J]. Journal of Southeast University (English Edition), 2008, 24 (3): 381-384. [doi:10.3969/j.issn.1003-7985.2008.03.032]
Copy

System of twice-gathering informationand research of information fingerprint HashTrie()
Share:

Journal of Southeast University (English Edition)[ISSN:1003-7985/CN:32-1325/N]

Volumn:
24
Issue:
2008 3
Page:
381-384
Research Field:
Computer Science and Engineering
Publishing date:
2008-09-30

Info

Title:
System of twice-gathering informationand research of information fingerprint HashTrie
Author(s):
Shen Yang1 Zhu Chanyuan2 Li Shuchen3
1School of Information Management, Wuhan University, Wuhan 430072, China
2School of Computer Science, Wuhan University, Wuhan 430072, China
3 International School of Software, Wuhan University, Wuhan 430072, China
Keywords:
physical isolation twice-gathering duplicated web pages elimination information fingerprint HashTrie
PACS:
TP393.09
DOI:
10.3969/j.issn.1003-7985.2008.03.032
Abstract:
This paper presents a twice-gathering information interactive system prototype of e-government based on the condition that the Intranet and the Extranet are physical isolated.Users in the Extranet can gather links of the latest related information from client software which is previously collected by web alert in the Internet.Finally, through ferry-type transport devices, information is browsed by users in the Intranet, and it is transported to a storage device and synchronized with the web platform in the Intranet.During information gathering in the Extranet and data synchronization in the Intranet, it is essential to avoid repeated gathering and copying by means of comparing the extracted information fingerprints gathered from the web pages.This prototype uses HashTrie to store information fingerprints.During testing, the structure based on HashTrie is 2.28 times faster than the Darts(double array Trie)which is the fastest structure in the existing applied patent.The existing 12 types of high speed Hash functions serving for HashTrie are also implemented.When the dictionary content is larger than 5×105 words, the PJWHash or the SuperFastHush function can be adopted;when the dictionary content is 105 words, CalcStrCR32 and ELFHash functions can be adopted.

References:

[1] Garmeli B.Gap appliance enhance security [M].Network World, 2001:36-39.
[2] Northcutt S, Zeltzer L, Winters S, et al.Inside network perimeter security [M].New Riders, 2003:78-83.
[3] Zhang Sida.Information ferrying system [J].Journal of Chengdu University of Information Technology, 2004, 19(1):62-65.(in Chinese)
[4] Soumen Chakrabarti.Mining the web [M].San Francisco:Morgan Kaufmann Publishers, 2003:118-120.
[5] Yan T W, Garcia-Molina H.Duplicate removal in information dissemination [C]//Proceedings of the 21st International Conference on Very Large Data Bases(VLDB’95).San Francisco, CA, USA, 1995:66-77.
[6] Di Iorio E, Diligenti M, Gori M, et al.Detecting near-replicas on the web by content and hyperlink analysis [C]//Proceedings of the IEEE/WIC International Conference on Web Intelligence. New York:IEEE Computer Society Press, 2003:249-255.
[7] Fogaras D, Racz B.Practical algorithms and lower bounds for similarity search in massive graphs [J].IEEE Transactions on Knowledge and Data Engineering, 2007, 19(5):585-598.
[8] Monika H.Finding near-duplicate web pages:a large-scale evaluation of algorithms [C]//Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Seattle, 2006:284-291.
[9] Dean J, Henzinger M R.Finding related pages in the world wide web [J].Computer Networks, 1999, 31(11):1467-1479.
[10] Haveliwala T, Gionis A, Klein D, et al.Evaluating strategies for similarity search on the web [C]//Proceedings of the 11th International World Wide Web Conference. Hawaii, USA, 2002:432-442.

Memo

Memo:
Biography: Shen Yang(1974—), male, doctor, associate professor, 124739259@qq.com.
Foundation item: The National Basic Research Program of China(973 Program)(No.2007CB310806).
Citation: Shen Yang, Zhu Chanyuan, Li Shuchen.Systems of twice-gathering information and research of information fingerprint HashTrie[J].Journal of Southeast University(English Edition), 2008, 24(3):381-384.
Last Update: 2008-09-20