Hadoop: Does using CombineFileInputFormat for smal

2019-09-08 09:02发布

I am new to hadoop and peforming some tests on local machine.

There have been many solutions to deal with many small files. I am using CombinedInputFormat which extends CombineFileInputFormat.

I see that number of mapper have changed from 100 to 25 with CombinedInputFormat. Should I also expect any performance gain since number of mappers have reduced?

I have performed the map-reduce job on many small files without CombinedInputFormat: 100 mappers took 10 minutes

But when the map-reduce job was executed with CombinedInputFormat: 25 mappers took 33 minutes.

Any help will be appreciated.

1条回答
可以哭但决不认输i
2楼-- · 2019-09-08 09:46

Hadoop performs better with a small number of large files, as opposed to a huge number of small files. ("Small" here means significantly smaller than a Hadoop Distributed File System (HDFS) block."Number" means ranging to 1000s).

That means if you have 1000 1Mb size file the Map-reduce job based on normal TextInputFormat will create 1000 Map tasks, each of these map tasks require certain amount of time to start and end. This latency in task creation can reduce the performance of the job

In a multi tenant cluster with resource limitation, getting large number of Map slots also will be difficult.

Please refer this link for more details and Benchmark results.

查看更多
登录 后发表回答