Crawling websites which ask for authentication

2019-06-10 02:53发布

I followed this https://wiki.apache.org/nutch/HttpAuthenticationSchemes link for crawling few websites by providing username and password

Work around:I have set the auth-configuration in httpclient-auth.xml file:

<auth-configuration>
<credentials username="xyz" password="xyz">
<default realm="domain" />
<authscope host="www.gmail.com" port="80"/>
</credentials>
</auth-configuration>

ii)Define httpclient property in both nutch-site.xml and nutch-default.xml

<property>
<name>plugin.includes</name>
<value>protocol-httpclient|urlfilter-regex|parse-(text|html|js)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>

iii) Also have defined the auth configuration file in nutch-site.xml.

<property>
<name>http.auth.file</name>
<value>httpclient-auth.xml</value>
<description>Authentication configuration file for 'protocol-httpclient' plugin.
</description>

I'm not able to crawl it and getting no error..

Requirements: I want to crawl websites like gmail.com or yahoomail.com or anything which asks for authentication.

Where am i going wrong, am i choosing wrong websites for crawling

( if yes please provide me the websites which asks for authentication I'll register for it)

(if no how can i crawl my gmail or facebook accounts)

1条回答
Bombasti
2楼-- · 2019-06-10 03:40

Few points which will help you in resolving this issue:

1) Yes you have chosen wrong website to crawl and index try some different websites.

2) Nutch only support NTLM, Basic or Digest authentication. It do not support the Form Based Authentication. The sites that you are trying use have Form based Authentication.

3) To implement Form Based Authentication you will have to customize your Nutch code.

I am sure following 2 links will help you in making some progress in this issue that you are facing:

http://technical-fundas.blogspot.in/2014/05/nutch-solr-formed-based-authentication.html

http://technical-fundas.blogspot.in/2014/06/how-to-configure-nutch-in-eclipse-for.html

查看更多
登录 后发表回答