I want to get the content of a page and extract the specific parts of it. As far as I know there are at least two solutions for such task: Crawler4j and Jsoup.
Both of them are capable retrieving the content of a page and extract sub-parts of it. The only thing I don't understand what is the difference between them? There is a similar question, which is marked as answered:
Crawler4j is a crawler, Jsoup is a parser.
But I just checked, Jsoup 1.8.3 is also capable crawling a page in addition to a parsing functionality, while Crawler4j is capable not only crawling the page but parsing its content.
Thus, can you, please, clarify the difference between Crawler4j and Jsoup?
Crawling is something bigger than just retrieving the contents of a single URI. If you just want to retrieve the content of some pages then there is no real benefit from using something like
Crawler4J
.Let's take a look at an example. Assume you want to crawl a website. The requirements would be:
About
page has a link for theHome
page, but we already got the contents ofHome
page so don't visit it again).Home
page.This is a simple scenario. Try solving this with
Jsoup
. All this functionality must be implemented by you. Crawler4J or any crawler micro framework for that matter, would or should have an implementation for the actions above.Jsoup
's strong qualities shine when you get to decide what to do with the content.Let's take a look at some requirements for parsing.
HTML
specs)This is where
Jsoup
comes to play. Of course, there is some overlapping here. Some things might be possible with bothCrawler4J
orJsoup
, but that doesn't make them equivalent. You could remove the mechanism of retrieving content fromJsoup
and still be an amazing tool to use. IfCrawler4J
would remove the retrieval, then it would lose half of its functionality.I used both of them in the same project in a real life scenario. I crawled a site, leveraging the strong points of
Crawler4J
, for all the problems mentioned in the first example. Then I passed the content of each page I retrieved toJsoup
, in order to extract the information I needed. Could I have not used one or the other? Yes, I could, but I would have had to implement all the missing functionality.Hence the difference,
Crawler4J
is a crawler with some simple operations for parsing (you could extract the images in one line), but there is no implementation for complexCSS
queries.Jsoup
is a parser that gives you a simple API forHTTP
requests. For anything more complex there is no implementation.