I am new to Nutch. I need to crawl the web (say, a few hundred web pages), read the crawled data and do some analysis.
I followed the link https://wiki.apache.org/nutch/NutchTutorial (and integrated Solr since I may require to search text in future) and ran the crawl using a few URLs as the seed.
Now, I don't find the text/html
data in my local machine. Where can I find the data and what is the best way to read the data in text format?
Versions
- apache-nutch-1.9
- solr-4.10.4
After your crawl is over, you could use the bin/nutch dump command to dump all the urls fetched in plain html format.
The usage is as follows :
$ bin/nutch dump [-h] [-mimetype <mimetype>] [-outputDir <outputDir>]
[-segment <segment>]
-h,--help show this help message
-mimetype <mimetype> an optional list of mimetypes to dump, excluding
all others. Defaults to all.
-outputDir <outputDir> output directory (which will be created) to host
the raw data
-segment <segment> the segment(s) to use
So for example you could do something like
$ bin/nutch dump -segment crawl/segments -outputDir crawl/dump/
This would create a new dir at the -outputDir location and dump all the pages crawled in html format.
There are many more ways of dumping out specific data from Nutch, have a look at https://wiki.apache.org/nutch/CommandLineOptions