USAGE OF DEDICATED DATA STRUCTURES FOR URL DATABASES IN A LARGE-SCALE CRAWLING
DOI:
https://doi.org/10.7494/csci.2009.10.3.7Keywords:
crawling, crawler, large-scale, Berkeley DB, URL database, URL repository, data structuresAbstract
The article discuss usage of Berkeley DB data structures such as hash tables and b-trees forimplementation of a high performance URL database. The article presents a formal model fora data structures oriented URL database, which can be used as an alternative for a relationaloriented URL database.Downloads
References
S. Brin, L. Page: The Anatomy of a Large-Scale Hypertextual Web Search Engine. in Proc. WWW, pp. 107–117, 1998
S. Brin, L. Page, R. Motwami, T. Winograd: The PageRank citation ranking: bringing order to the web. Proceedings of ASIS’98, 1998
A. Heydon, M. Najork: Mercator: A Scalable, Extensible Web Crawler. World Wide Web, vol. 2, no. 4, pp. 219–229, 1999
M. Najork, A. Heydon: High-Performance Web Crawling. World Wide Web, vol. 2, no. 4, pp. 219–229, 2001
V. Shkapenyuk, T. Suel: Design and Implementation of a High-Performance Distributed Web Crawler. in Proc. IEEE ICDE, pp. 357–368, 2002
J. Cho, H. Garcia-Molina, T. Haveliwala, W. Lam, A. Paepcke, S. R. G. Wesley: Stanford WebBase Components and Applications. ACM Transactions on Internet Technology, vol. 6, no. 2, pp. 153–186, 2006
H.-T. Lee, D. Leonard, X. Wang, D. Loguinov: IRLbot: Scaling to 6 Billion Pages and Beyond. Texas A&M University, Tech. Rep. 2008-2-2, 2008
C. Olston, S. Pandey: Recrawl scheduling based on information longevity. conf/www/2008, pp. 437–446, 2008
J. Cho, H. Garcia-Molina: Effective Page Refresh Policies for Web Crawlers. ACM Transactions on Database Systems, 28 (4), 2003
E. Coffman, Z. Liu, R. R.Weber: Optimal robot scheduling for web search engines. Journal of Scheduling, 1, 1998
J. Edwards, K. S. McCurley, J. A. Tomlin: An Adaptive Model for Optimizing Performance of an Incremental Web Crawler. In Proc. WWW, 2001
S. Pandey, C. Olston: User-centric web crawling. In Proc. WWW, 2005
J. Wolf, M. Squillante, P. S.Yu, J.Sethuraman, L. Ozsen: Optimal Crawling Strategies for Web Search Engines. In Proc. WWW, 2002
W. Litwin: Linear Hashing: A New Tool for File and Table Addressing. Proceedings of the 6th International Conference on Very Large Databases (VLDB), 1980