Sending cookies in request with crawler4j?

2019-07-03 16:51发布

问题:

I need to grab some links that are depending on the sent cookies within a GET Request. So when I want to crawl the page with crawler4j I need to send some cookies with it to get the correct page back.

Is this possible (I searched the web for it, but didn't find something useful)? Or is there a Java crawler out there who is capable doing this?

Any help appreciated.

回答1:

It appears that crawler4j might not support cookies: http://www.webuseragents.com/ua/427106/crawler4j-http-code-google-com-p-crawler4j-

There are several alternatives:

  • Nutch
  • Heritrix
  • WebSPHINX
  • JSpider
  • WebEater
  • WebLech
  • Arachnid
  • JoBo
  • Web-Harvest
  • Ex-Crawler
  • Bixo

I would say that Nutch and Heritrix are the best ones and I would put special emphasis on Nutch, because it's probably one of the only crawlers that is designed to scale well and actually perform a big crawl.



回答2:

Coming late to this thread but actually crawler4j does a good job of handling cookies. You can even inspect cookie values because you can get hold of the underlying HTTP client (apache). For example:

@Override
public void visit(Page page) {
    super.visit(page);

    DefaultHttpClient httpClient = (DefaultHttpClient) getMyController().getPageFetcher().getHttpClient();
    for (Cookie cookie : httpClient.getCookieStore().getCookies()) {
        if ( cookie.getName().equals("somename") ) {
            String value = cookie.getValue();
        }
    }
}

I looked briefly at Nutch but crawler4j seems simpler to integrate (5 minutes using maven dependency) and was perfect for my needs (I was testing that session cookie is maintained on my site across a large number of requests).