After a Nutch crawl in distributed (deploy) mode as follows:
bin/nutch crawl s3n://..... -depth 10 -topN 50000 -dir /crawl -threads 20
I need to extract each URL fetched along with it's content in a map reduce friendly format. By using the readseg command below, the contents are fetched but the output format doesn't lend itself to being map reduced.
bin/nutch readseg -dump /crawl/segments/* /output -nogenerate -noparse -noparsedata -noparsetext
Ideally the output should be in this format:
http://abc.com/1 content of http://abc.com/1
http://abc.com/2 content of http://abc.com/2
Any suggestions on how to achieve this?
The
bin/nutch readseg
command produces output in a human readable format and not map reduce format. The data is stored in segments in map-reduce format. I dont think that you can directly pull out that info from segements in map-reduce format.Few options for your concern:
readseg
command can be converted to map-reduce form by writing a small map-reduce code.The answer lies in tweaking the source code of nutch. This turned out to be quite simple. Navigate to the
SegmentReader.java
file atapache-nutch-1.4-bin/src/java/org/apache/nutch/segment
Inside the
SegmentReader
class is a methodreduce
which is responsible for generating the human readable output thebin/nutch readseg
command generates. Alter theStringBuffer dump
variable as you see fit - this holds the entire output for a given url which is represented by thekey
variable.Make sure you to run
ant
to create a new binary and further calls tobin/nutch readseg
shall generate the output in your custom format.These references were extremely useful in navigating the code:
[1] http://nutch.apache.org/apidocs-1.4/overview-summary.html
[2] http://nutch.apache.org/apidocs-1.3/index-all.html