I have many files in HDFS, all of them a zip file with one CSV file inside it. I'm trying to uncompress the files so I can run a streaming job on them.
I tried:
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-D mapred.reduce.tasks=0 \
-mapper /bin/zcat -reducer /bin/cat \
-input /path/to/files/ \
-output /path/to/output
However I get an error (subprocess failed with code 1
)
I also tried running on a single file, same error.
Any advice?
After experimenting around, I discovered that if you do this modification to hadoop streaming, you will get all your gzipped files uncompressed in a new directory. The file names are all lost (renamed to the typical part-XXXX name), but this worked for me.
I speculate this works because hadoop automatically uncompresses gzipped files under the hood, and cat is just echoing that unzipped output
Hadoop can read files compressed in the gzip format, but that's different from the zip format. Hadoop cannot read zip files AFAIK.
The root cause of the problem is: you get many (text-)infos from hadoop (before you can receive the data).
e.g. hdfs dfs -cat hdfs://hdm1.gphd.local:8020/hive/gphd/warehouse/my.db/my/part-m-00000.gz | zcat | wc -l will NOT work either - with "gzip: stdin: not in gzip format" error message.
Therefore you should skip this "unneccesary" infos. In my case I have to skip 86 lines
Therefore my one line command will be this (for counting the records): hdfs dfs -cat hdfs://hdm1.gphd.local:8020/hive/gphd/warehouse/my.db/my/part-m-00000.gz |tail -n+86 | zcat | wc -l
Note: this is a workaround (not a real solution) and very ugly - because of "86" - but it works fine :)
A Simple way to unzip / uncompress a file within HDFS for whatever reason