Crawler in Groovy (JSoup VS Crawler4j)

2019-03-28 11:52发布

问题:

I wish to develop a web crawler in Groovy(using Grails framework and MongoDB database) that has the ability to crawl a website, creating a list of site URLs and their resource types, their content, the response times and number of redirects involved.

I am debating over JSoup vs Crawler4j. I have read about what they basically do but I cannot understand clearly the difference between the two. Can anyone suggest which would be a better one for the above functionality? Or is it totally incorrect to compare the two?

Thanks.

回答1:

Crawler4J is a crawler, Jsoup is a parser. Actually you could/should use both. Crawler4J is an easy-multithreaded interface to get all the urls and all the pages(content) of the site you want. After that you can use Jsoup in order to parse the data, with amazing (jquery-like) css selectors and actually do something with it. Of course you have to consider dynamic (javascript generated) content. If you want that content too, then you have to use something else that includes a javascript engine (headless browser + parser) like htmlunit or webdriver (selenium), that will execute javascript before parsing the content.