I'm using Cloudera Hadoop. I'm able to run simple mapreduce program where I provide a file as input to MapReduce program.
This file contains all the other files to be processed by mapper function.
But, I'm stuck at one point.
/folder1
- file1.txt
- file2.txt
- file3.txt
How can I specify the input path to MapReduce program as "/folder1"
, so that it can start processing each file inside that directory ?
Any ideas ?
EDIT :
1) Intiailly, I provided the inputFile.txt as input to mapreduce program. It was working perfectly.
>inputFile.txt
file1.txt
file2.txt
file3.txt
2) But now, instead of giving an input file, I want to provide with an input directory as arg[0] on command line.
hadoop jar ABC.jar /folder1 /output
The Problem is FileInputFormat doesn't read files recursively in the input path dir.
Solution: Use Following code
FileInputFormat.setInputDirRecursive(job, true);
Before below line in your Map Reduce CodeFileInputFormat.addInputPath(job, new Path(args[0]));
You can check here for which version it was fixed.
you could use FileSystem.listStatus to get the file list from given dir, the code could be as below:
Use MultipleInputs class.
Have a look at working code
you can use hdfs wildcards in order to provide multiple files
so, the solution :
or