Hadoop : Provide directory as input to MapReduce j

I'm using Cloudera Hadoop. I'm able to run simple mapreduce program where I provide a file as input to MapReduce program.

This file contains all the other files to be processed by mapper function.

But, I'm stuck at one point.

/folder1
  - file1.txt
  - file2.txt
  - file3.txt

How can I specify the input path to MapReduce program as "/folder1", so that it can start processing each file inside that directory ?

Any ideas ?

EDIT :

1) Intiailly, I provided the inputFile.txt as input to mapreduce program. It was working perfectly.

>inputFile.txt
file1.txt
file2.txt
file3.txt

2) But now, instead of giving an input file, I want to provide with an input directory as arg[0] on command line.

hadoop jar ABC.jar /folder1 /output

标签： java hadoop input mapreduce Cloudera

4条回答

成全新的幸福

2楼-- · 2019-01-23 19:48

The Problem is FileInputFormat doesn't read files recursively in the input path dir.

Solution: Use Following code

FileInputFormat.setInputDirRecursive(job, true); Before below line in your Map Reduce Code

FileInputFormat.addInputPath(job, new Path(args[0]));

You can check here for which version it was fixed.

0人赞添加讨论(0) 举报

萌系小妹纸

3楼-- · 2019-01-23 19:59

you could use FileSystem.listStatus to get the file list from given dir, the code could be as below:

//get the FileSystem, you will need to initialize it properly
FileSystem fs= FileSystem.get(conf); 
//get the FileStatus list from given dir
FileStatus[] status_list = fs.listStatus(new Path(args[0]));
if(status_list != null){
    for(FileStatus status : status_list){
        //add each file to the list of inputs for the map-reduce job
        FileInputFormat.addInputPath(conf, status.getPath());
    }
}

0人赞添加讨论(0) 举报

forever°为你锁心

4楼-- · 2019-01-23 20:02

Use MultipleInputs class.

MultipleInputs. addInputPath(Job job, Path path, Class<? extends InputFormat> 
inputFormatClass, Class<? extends Mapper> mapperClass)

Have a look at working code

0人赞添加讨论(0) 举报

我欲成王，谁敢阻挡

5楼-- · 2019-01-23 20:11

you can use hdfs wildcards in order to provide multiple files

so, the solution :

hadoop jar ABC.jar /folder1/* /output

hadoop jar ABC.jar /folder1/*.txt /output

0人赞添加讨论(0) 举报

Hadoop : Provide directory as input to MapReduce j

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间