I am new to Nutch. I need to crawl the web (say, a few hundred web pages), read the crawled data and do some analysis.
I followed the link https://wiki.apache.org/nutch/NutchTutorial (and integrated Solr since I may require to search text in future) and ran the crawl using a few URLs as the seed.
Now, I don't find the text/html
data in my local machine. Where can I find the data and what is the best way to read the data in text format?
Versions
- apache-nutch-1.9
- solr-4.10.4
After your crawl is over, you could use the bin/nutch dump command to dump all the urls fetched in plain html format.
The usage is as follows :
So for example you could do something like
This would create a new dir at the -outputDir location and dump all the pages crawled in html format.
There are many more ways of dumping out specific data from Nutch, have a look at https://wiki.apache.org/nutch/CommandLineOptions