I need to grab some links that are depending on the sent cookies within a GET Request. So when I want to crawl the page with crawler4j I need to send some cookies with it to get the correct page back.
Is this possible (I searched the web for it, but didn't find something useful)? Or is there a Java crawler out there who is capable doing this?
Any help appreciated.
It appears that crawler4j might not support cookies: http://www.webuseragents.com/ua/427106/crawler4j-http-code-google-com-p-crawler4j-
There are several alternatives:
I would say that Nutch and Heritrix are the best ones and I would put special emphasis on Nutch, because it's probably one of the only crawlers that is designed to scale well and actually perform a big crawl.
Coming late to this thread but actually crawler4j does a good job of handling cookies. You can even inspect cookie values because you can get hold of the underlying HTTP client (apache). For example:
I looked briefly at Nutch but crawler4j seems simpler to integrate (5 minutes using maven dependency) and was perfect for my needs (I was testing that session cookie is maintained on my site across a large number of requests).