I'm trying to run a streaming job where the input files are csv inside zip files.
I tried using this, however it doesn't seem for work with CDH4 (I get the error class com.cotdp.hadoop.ZipFileInputFormat not org.apache.hadoop.mapred.InputFormat
)
Anyone know of an input file reader I can use for streaming with zip files? If possible, I'm looking for a multi file reader (that can be given the top level directory).
I ended up writing zipstream.
Note that is process only the first file in the zip, I'll probably add support for multiple files later.
There are two hadoop api's for input formats. mapred.InputFormat, and mapreduce.InputFormat.
mapreduce is the newer API and the one you should be using if you can.
I would check to see which InputFormat the ZipInputFormat actually implements. If it implements the mapreduce version you'll need to move your job over to this second API.
For a bit of background: In an earlier Hadoop version 'mapred' was depreciated in favor of 'mapreduce', a newer, faster, and cleaner implementation. Unfortunately this new API didn't include all the features of the old one, so in more recent versions of Hadoop 'mapred' was reinstated, and now there are two APIs that basically do the same thing.