Can someone please point out where I could find an implementation for CombineFileInputFormat
(org. using Hadoop 0.20.205? this is to create large splits from very small log files (text in lines) using EMR.
It is surprising that Hadoop does not have a default implementation for this class made specifically for this purpose and googling it looks like I'm not the only one confused by this. I need to compile the class and bundle it in a jar for hadoop-streaming, with a limited knowledge of Java this is some challenge.
Edit: I already tried the yetitrails example, with the necessary imports but I get a compiler error for the next method.
Here is an implementation I have for you:
In your job first set the parameter
mapred.max.split.size
according to the size you would like the input files to be combined into. Do something like follows in your run():