I'm running map-reduce and my inputs are gzipped, but do not have a .gz (file name) extension.
Normally, when they do have the .gz extension, Hadoop takes care of unzipping them on the fly before passing them to the mapper. However, without the extension it doesn't do so. I can't rename my files, so I need some way of "forcing" Hadoop to unzip them, even though they do not have the .gz extension.
I tried passing the following flags to Hadoop:
step_args=[ "-jobconf", "stream.recordreader.compression=gzip", "-jobconf", "mapred.output.compress=true", "-jobconf", "mapred.output.compression.type=block", "-jobconf", "mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec"]
However, the input to the mapper is still unzipped. I verified that by printing the inputs to the mapper inside the mapper code:
mapper input: ^_^@%r?T^B??\K??6^R?+F?3^D??b?^R,??!???a?^X?A??n?m?k?3id?o?z[?-?L2yt^P$n?T,^V????^??y^O^R?nno>}^B^E^N-7?^Z?'?I?OF4??-^Z^X4;????f?RH???^Z?Q??4#^W?I?^F??^]?f+???f0d??A??v?A3*????7?x?p??7?Mq?.g??{^FL?g?^Y+?6??I????^V?C??I??$??ESCVd)K??}?Z??j?,3?{ ?}v???j???^??"?.??^L?^?LX^F??p???
Any advice on how to unzip on the fly would be greatly appreciated !
Thanks! Gil.