Web Crawler Architecture

Marc Najork

Web Crawler Architecture

Marc Najork

in Encyclopedia of Database Systems

Published by Springer Verlag | 2009 | Encyclopedia of Database Systems edition

ISBN: 978-0-387-39940-9

Publication

Download BibTex

A web crawler is a program that, given one or more seed URLs, downloads the web pages associated with these URLs, extracts any hyperlinks contained in them, and recursively continues to download the web pages identified by these hyperlinks. Web crawlers are an important component of web search engines, where they are used to collect the corpus of web pages indexed by the search engine. Moreover, they are used in many other applications that process large numbers of web pages, such as web data mining, comparison shopping engines, and so on. Despite their conceptual simplicity, implementing high-performance web crawlers poses major engineering challenges due to the scale of the web. In order to crawl a substantial fraction of the “surface web” in a reasonable amount of time, web crawlers must download thousands of pages per second, and are typically distributed over tens or hundreds of computers.

All copyrights reserved by Springer 2009. This entry has been published in the Encyclopedia of Database Systems by Springer. The Encyclopedia, under the editorial guidance of Ling Liu and M. Tamer Ã–zsu, is a multiple volume, comprehensive, and authoritative reference on databases, data management, and database systems. Since it is available in both print and online formats, researchers, students, and practitioners benefit from advanced search functionality and convenient interlinking possibilities with related online content. The Encyclopedia's online version is accessible on the platform SpringerLink. Visit http://www.springer.com/computer/database+management++information+retrieval/book/978-0-387-49616-0 for more information about the Encyclopedia of Database Systems.