I am surprised that there is so little support or information out there for getting Nutch to be able to crawl parts of a website that require authentication.
I am aware that maybe Apache Nutch is not currently able to (but apparently hopes to) support Http POST authentication.
However, all we really want to do is be able to add a cookie to our Nutch bot header that will allow it to access those parts of the site that way (rather than post a username and password to a form and then receive the cookie).
So I have spent a good amount of time searching and am surprised that most discussions about this are all the way back in 2005 or 2008: here, there, everywhere.
After all these years, is there anyway to work around this limitation or is there just still no way to authenticate by giving Nutch a 'prebaked' cookie so it can access member only parts of our site?.
I have added custom code to nutch protocol-httpclient plugin to solve the issue.
Shared the changes in the link below
http://www.gingercart.com/Home/search-and-crawl/nutch-custom-authentication-cookies-session-management-to-crawl-secure-enterprise-websites