I'm using Cloudera Hadoop. I'm able to run simple mapreduce program where I provide a file as input to MapReduce program.
This file contains all the other files to be processed by mapper function.
But, I'm stuck at one point.
/folder1
- file1.txt
- file2.txt
- file3.txt
How can I specify the input path to MapReduce program as "/folder1"
, so that it can start processing each file inside that directory ?
Any ideas ?
EDIT :
1) Intiailly, I provided the inputFile.txt as input to mapreduce program. It was working perfectly.
>inputFile.txt
file1.txt
file2.txt
file3.txt
2) But now, instead of giving an input file, I want to provide with an input directory as arg[0] on command line.
hadoop jar ABC.jar /folder1 /output
The Problem is FileInputFormat doesn't read files recursively in the input path dir.
Solution: Use Following code
FileInputFormat.setInputDirRecursive(job, true);
Before below line in your Map Reduce Code
FileInputFormat.addInputPath(job, new Path(args[0]));
You can check here for which version it was fixed.
you could use FileSystem.listStatus to get the file list from given dir, the code could be as below:
//get the FileSystem, you will need to initialize it properly
FileSystem fs= FileSystem.get(conf);
//get the FileStatus list from given dir
FileStatus[] status_list = fs.listStatus(new Path(args[0]));
if(status_list != null){
for(FileStatus status : status_list){
//add each file to the list of inputs for the map-reduce job
FileInputFormat.addInputPath(conf, status.getPath());
}
}
you can use hdfs wildcards in order to provide multiple files
so, the solution :
hadoop jar ABC.jar /folder1/* /output
or
hadoop jar ABC.jar /folder1/*.txt /output
Use MultipleInputs class.
MultipleInputs. addInputPath(Job job, Path path, Class<? extends InputFormat>
inputFormatClass, Class<? extends Mapper> mapperClass)
Have a look at working code