I am trying to parse data from commoncrawl.org using hadoop streaming. I set up a local hadoop to test my code, and have a simple Ruby mapper which uses a streaming ARCfile reader. When I invoke my code myself like
cat 1262876244253_18.arc.gz | mapper.rb | reducer.rb
It works as expected.
It seems that hadoop automatically sees that the file has a .gz extension and decompresses it before handing it to a mapper - however while doing so it converts \r\n linebreaks in the stream to \n. Since ARC relies on a record length in the header line, the change breaks the parser (because the data length has changed).
To double check, I changed my mapper to expect uncompressed data, and did:
cat 1262876244253_18.arc.gz | zcat | mapper.rb | reducer.rb
And it works.
I don't mind hadoop automatically decompressing (although I can quite happily deal with streaming .gz files), but if it does I need it to decompress in 'binary' without doing any linebreak conversion or similar. I believe that the default behaviour is to feed decompressed files to one mapper per file, which is perfect.
How can I either ask it not to decompress .gz (renaming the files is not an option) or make it decompress properly? I would prefer not to use a special InputFormat class which I have to ship in a jar, if at all possible.
All of this will eventually run on AWS ElasticMapReduce.
Looks like the Hadoop PipeMapper.java is to blame (at least in 0.20.2):
Around line 106, the input from TextInputFormat is passed to this mapper (at which stage the \r\n has been stripped), and the PipeMapper is writing it out to stdout with just a \n.
A suggestion would be to amend the source for your PipeMapper.java, check this 'feature' still exists, and amend as required (maybe allow it to be set via a configuration property).