USAGE OF DEDICATED DATA STRUCTURES FOR URL DATABASES IN A LARGE-SCALE CRAWLING

Authors

  • Krzysztof Dorosz AGH University of Science and Technology

DOI:

https://doi.org/10.7494/csci.2009.10.3.7

Keywords:

crawling, crawler, large-scale, Berkeley DB, URL database, URL repository, data structures

Abstract

The article discuss usage of Berkeley DB data structures such as hash tables and b-trees forimplementation of a high performance URL database. The article presents a formal model fora data structures oriented URL database, which can be used as an alternative for a relationaloriented URL database.

Downloads

Download data is not yet available.

Author Biography

Krzysztof Dorosz, AGH University of Science and Technology

Institute of Computer Science

References

S. Brin, L. Page: The Anatomy of a Large-Scale Hypertextual Web Search Engine. in Proc. WWW, pp. 107–117, 1998

S. Brin, L. Page, R. Motwami, T. Winograd: The PageRank citation ranking: bringing order to the web. Proceedings of ASIS’98, 1998

A. Heydon, M. Najork: Mercator: A Scalable, Extensible Web Crawler. World Wide Web, vol. 2, no. 4, pp. 219–229, 1999

M. Najork, A. Heydon: High-Performance Web Crawling. World Wide Web, vol. 2, no. 4, pp. 219–229, 2001

V. Shkapenyuk, T. Suel: Design and Implementation of a High-Performance Distributed Web Crawler. in Proc. IEEE ICDE, pp. 357–368, 2002

J. Cho, H. Garcia-Molina, T. Haveliwala, W. Lam, A. Paepcke, S. R. G. Wesley: Stanford WebBase Components and Applications. ACM Transactions on Internet Technology, vol. 6, no. 2, pp. 153–186, 2006

H.-T. Lee, D. Leonard, X. Wang, D. Loguinov: IRLbot: Scaling to 6 Billion Pages and Beyond. Texas A&M University, Tech. Rep. 2008-2-2, 2008

C. Olston, S. Pandey: Recrawl scheduling based on information longevity. conf/www/2008, pp. 437–446, 2008

J. Cho, H. Garcia-Molina: Effective Page Refresh Policies for Web Crawlers. ACM Transactions on Database Systems, 28 (4), 2003

E. Coffman, Z. Liu, R. R.Weber: Optimal robot scheduling for web search engines. Journal of Scheduling, 1, 1998

J. Edwards, K. S. McCurley, J. A. Tomlin: An Adaptive Model for Optimizing Performance of an Incremental Web Crawler. In Proc. WWW, 2001

S. Pandey, C. Olston: User-centric web crawling. In Proc. WWW, 2005

J. Wolf, M. Squillante, P. S.Yu, J.Sethuraman, L. Ozsen: Optimal Crawling Strategies for Web Search Engines. In Proc. WWW, 2002

W. Litwin: Linear Hashing: A New Tool for File and Table Addressing. Proceedings of the 6th International Conference on Very Large Databases (VLDB), 1980

Downloads

Published

2013-03-20

How to Cite

Dorosz, K. (2013). USAGE OF DEDICATED DATA STRUCTURES FOR URL DATABASES IN A LARGE-SCALE CRAWLING. Computer Science, 10(3), 7. https://doi.org/10.7494/csci.2009.10.3.7

Issue

Section

Articles

Most read articles by the same author(s)