General Crawler Process

The process for a typical multi-threaded crawler is as follows:

We have a queue data structure, which is called frontier. Newly discovered URLs (or start points, so-called seeds) are added to this datastructure. In addition, for every URL a unique ID is assigned in order to determine, if a given URL was previously visited.
Crawler threads then obtain URLs from the frontier and schedule them for later processing.
The actual processing starts:
- The robots.txt for the given URL is determined and parsed to honour exclusion criteria and be a polite web-crawler (configurable)
- Next, the thread will check for politeness, i.e. time to wait before visting the same host of an URL again.
- The actual URL is vistied by the crawler and the content is downloaded (this can be literally everything)
- If we have HTML content, this content is parsed and potential new URLs are extracted and added to the frontier (in crawler4j this can be controlled via shouldVisit(...)).
The whole process is repeated until no new URLs are added to the frontier.

General (Focused) Crawler Architecture

Besides the implementation details of crawler4j a more or less general (focused) crawler architecture (on a single server/pc) looks like this:

Disclaimer: Image is my own work. Please respect this by referencing this post.

0人赞添加讨论(0) 举报