Hadoop job taking input files from multiple direct

2019-01-23 18:19发布


I have a situation where I have multiple (100+ of 2-3 MB each) files in compressed gz format present in multiple directories. For Example
A1/B1/C1/part-0000.gz
A2/B2/C2/part-0000.gz
A1/B1/C1/part-0001.gz

I have to feed all these files into one Map job. From what I see , for using MultipleFileInputFormat all input files need to be in same directory . Is it possible to pass multiple directories directly into the job?
If not , then is it possible to efficiently put these files into one directory without naming conflict or to merge these files into 1 single compressed gz file.
Note: I am using plain java to implement the Mapper and not using Pig or hadoop streaming.

Any help regarding the above issue will be deeply appreciated.
Thanks,
Ankit

1条回答
爱情/是我丢掉的垃圾
2楼-- · 2019-01-23 18:53

FileInputFormat.addInputPaths() can take a comma separated list of multiple files, like

FileInputFormat.addInputPaths("foo/file1.gz,bar/file2.gz")
查看更多
登录 后发表回答