Nutch: Authentication via putting a cookie in the

2019-05-27 04:24发布

问题:

I am surprised that there is so little support or information out there for getting Nutch to be able to crawl parts of a website that require authentication.

I am aware that maybe Apache Nutch is not currently able to (but apparently hopes to) support Http POST authentication.

However, all we really want to do is be able to add a cookie to our Nutch bot header that will allow it to access those parts of the site that way (rather than post a username and password to a form and then receive the cookie).

So I have spent a good amount of time searching and am surprised that most discussions about this are all the way back in 2005 or 2008: here, there, everywhere.

After all these years, is there anyway to work around this limitation or is there just still no way to authenticate by giving Nutch a 'prebaked' cookie so it can access member only parts of our site?.

回答1:

I have added custom code to nutch protocol-httpclient plugin to solve the issue.

Shared the changes in the link below

http://www.gingercart.com/Home/search-and-crawl/nutch-custom-authentication-cookies-session-management-to-crawl-secure-enterprise-websites