I have installed the apache nutch for web crawling. I want to crawl a website that has the following robots.txt
:
User-Agent: *
Disallow: /
Is there any way to crawl this website with apache nutch?
I have installed the apache nutch for web crawling. I want to crawl a website that has the following robots.txt
:
User-Agent: *
Disallow: /
Is there any way to crawl this website with apache nutch?
In nutch-site.xml, set protocol.plugin.check.robots to false
OR
You can comment out the code where the robots check is done. In Fetcher.java, lines 605-614 are doing the check. Comment that entire block
if (!rules.isAllowed(fit.u)) {
// unblock
fetchQueues.finishFetchItem(fit, true);
if (LOG.isDebugEnabled()) {
LOG.debug("Denied by robots.txt: " + fit.url);
}
output(fit.url, fit.datum, null, ProtocolStatus.STATUS_ROBOTS_DENIED, CrawlDatum.STATUS_FETCH_GONE);
reporter.incrCounter("FetcherStatus", "robots_denied", 1);
continue;
}
You can set the property "Protocol.CHECK_ROBOTS" to false in nutch-site.xml to ignore robots.txt.