I'd like to learn,
- how crawler4j works?
- Does it fetch web page then download its content and extract it ?
- What about .db and .cvs file and its structures?
Generally ,What sequences it follows?
please, I want a descriptive content
Thanks
I'd like to learn,
Generally ,What sequences it follows?
please, I want a descriptive content
Thanks
General Crawler Process
The process for a typical multi-threaded crawler is as follows:
We have a queue data structure, which is called
frontier
. Newly discovered URLs (or start points, so-called seeds) are added to this datastructure. In addition, for every URL a unique ID is assigned in order to determine, if a given URL was previously visited.Crawler threads then obtain URLs from the
frontier
and schedule them for later processing.The actual processing starts:
robots.txt
for the given URL is determined and parsed to honour exclusion criteria and be a polite web-crawler (configurable)crawler4j
this can be controlled viashouldVisit(...)
).The whole process is repeated until no new URLs are added to the
frontier
.General (Focused) Crawler Architecture
Besides the implementation details of
crawler4j
a more or less general (focused) crawler architecture (on a single server/pc) looks like this:Disclaimer: Image is my own work. Please respect this by referencing this post.