I run a hadoop streaming job like this:
hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar
-Dmapred.reduce.tasks=16
-Dmapred.output.compres=true
-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec
-input foo
-output bar
-mapper "python zot.py"
-reducer /bin/cat
I do get 16 files in the output directory which contain the correct data, but the files are not compressed:
$ hadoop fs -get bar/part-00012
$ file part-00012
part-00012: ASCII text, with very long lines
- why is
part-00012
not compressed? - how do I have my data set split into a small number (say, 16) gzip-compressed files?
PS. See also "Using gzip as a reducer produces corrupt data"
PPS. This is for vw.
PPPS. I guess I can do hadoop fs -get
, gzip
, hadoop fs -put
, hadoop fs -rm
16 times, but this seems like a very non-hadoopic way.
There is a typo in your mapred.output.compres parameter. If you look through your job history I'll bet it's turned off.
Also you could avoid having the reduce-stage all together, since that's just catting files. Unless you specifically need 16 part files, try leaving it map-only.