This question already has an answer here:
-
merge output files after reduce phase
10 answers
I know that "getmerge" command in shell can do this work.
But what should I do if I want to merge these outputs after the job by HDFS API for java?
What i actually want is a single merged file on HDFS.
The only thing i can think of is to start an additional job after that.
thanks!
But what should I do if I want to merge these outputs after the job by HDFS API for java?
Guessing, because I haven't tried this myself, but I think the method you are looking for is FileUtil.copyMerge, which is the method that FsShell invokes when you run the -getmerge
command. FileUtil.copyMerge
takes two FileSystem objects as arguments - FsShell uses FileSystem.getLocal to retrieve the destination FileSystem, but I don't see any reason you couldn't instead use Path.getFileSystem on the destination to obtain an OutputStream
That said, I don't think it wins you very much -- the merge is still happening in the local JVM; so you aren't really saving very much over -getmerge
followed by -put
.
You get a single Out-put File by Setting a single Reducer in your code .
Job.setNumberOfReducer(1);
Will work for your requirement , but costly
OR
Static method to execute a shell command.
Covers most of the simple cases without requiring the user to implement the Shell interface.
Parameters:
env the map of environment key=value
cmd shell command to execute.
Returns:
the output of the executed command.
org.apache.hadoop.util.Shell.execCommand(String[])