What sequence of steps does crawler4j follow to fe

2019-07-25 02:28发布

I'd like to learn,

  1. how crawler4j works?
  2. Does it fetch web page then download its content and extract it ?
  3. What about .db and .cvs file and its structures?

Generally ,What sequences it follows?

please, I want a descriptive content

Thanks

1条回答
虎瘦雄心在
2楼-- · 2019-07-25 03:25

General Crawler Process

The process for a typical multi-threaded crawler is as follows:

  1. We have a queue data structure, which is called frontier. Newly discovered URLs (or start points, so-called seeds) are added to this datastructure. In addition, for every URL a unique ID is assigned in order to determine, if a given URL was previously visited.

  2. Crawler threads then obtain URLs from the frontier and schedule them for later processing.

  3. The actual processing starts:

    • The robots.txt for the given URL is determined and parsed to honour exclusion criteria and be a polite web-crawler (configurable)
    • Next, the thread will check for politeness, i.e. time to wait before visting the same host of an URL again.
    • The actual URL is vistied by the crawler and the content is downloaded (this can be literally everything)
    • If we have HTML content, this content is parsed and potential new URLs are extracted and added to the frontier (in crawler4j this can be controlled via shouldVisit(...)).
  4. The whole process is repeated until no new URLs are added to the frontier.

General (Focused) Crawler Architecture

Besides the implementation details of crawler4j a more or less general (focused) crawler architecture (on a single server/pc) looks like this:

basic crawler architecture

Disclaimer: Image is my own work. Please respect this by referencing this post.

查看更多
登录 后发表回答