I followed this https://wiki.apache.org/nutch/HttpAuthenticationSchemes link for crawling few websites by providing username and password
Work around:I have set the auth-configuration in httpclient-auth.xml file:
<auth-configuration>
<credentials username="xyz" password="xyz">
<default realm="domain" />
<authscope host="www.gmail.com" port="80"/>
</credentials>
</auth-configuration>
ii)Define httpclient property in both nutch-site.xml and nutch-default.xml
<property>
<name>plugin.includes</name>
<value>protocol-httpclient|urlfilter-regex|parse-(text|html|js)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>
iii) Also have defined the auth configuration file in nutch-site.xml.
<property>
<name>http.auth.file</name>
<value>httpclient-auth.xml</value>
<description>Authentication configuration file for 'protocol-httpclient' plugin.
</description>
I'm not able to crawl it and getting no error..
Requirements: I want to crawl websites like gmail.com or yahoomail.com or anything which asks for authentication.
Where am i going wrong, am i choosing wrong websites for crawling
( if yes please provide me the websites which asks for authentication I'll register for it)
(if no how can i crawl my gmail or facebook accounts)
Few points which will help you in resolving this issue:
1) Yes you have chosen wrong website to crawl and index try some different websites.
2) Nutch only support NTLM, Basic or Digest authentication. It do not support the Form Based Authentication. The sites that you are trying use have Form based Authentication.
3) To implement Form Based Authentication you will have to customize your Nutch code.
I am sure following 2 links will help you in making some progress in this issue that you are facing:
http://technical-fundas.blogspot.in/2014/05/nutch-solr-formed-based-authentication.html
http://technical-fundas.blogspot.in/2014/06/how-to-configure-nutch-in-eclipse-for.html