I followed this https://wiki.apache.org/nutch/HttpAuthenticationSchemes link for crawling few websites by providing username and password
Work around:I have set the auth-configuration in httpclient-auth.xml file:
<auth-configuration>
<credentials username="xyz" password="xyz">
<default realm="domain" />
<authscope host="www.gmail.com" port="80"/>
</credentials>
</auth-configuration>
ii)Define httpclient property in both nutch-site.xml and nutch-default.xml
<property>
<name>plugin.includes</name>
<value>protocol-httpclient|urlfilter-regex|parse-(text|html|js)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>
iii) Also have defined the auth configuration file in nutch-site.xml.
<property>
<name>http.auth.file</name>
<value>httpclient-auth.xml</value>
<description>Authentication configuration file for 'protocol-httpclient' plugin.
</description>
I'm not able to crawl it and getting no error..
Requirements: I want to crawl websites like gmail.com or yahoomail.com or anything which asks for authentication.
Where am i going wrong, am i choosing wrong websites for crawling
( if yes please provide me the websites which asks for authentication I'll register for it)
(if no how can i crawl my gmail or facebook accounts)