Hadoop: Does using CombineFileInputFormat for smal

2019-09-08 09:02发布

I am new to hadoop and peforming some tests on local machine.

There have been many solutions to deal with many small files. I am using CombinedInputFormat which extends CombineFileInputFormat.

I see that number of mapper have changed from 100 to 25 with CombinedInputFormat. Should I also expect any performance gain since number of mappers have reduced?

I have performed the map-reduce job on many small files without CombinedInputFormat: 100 mappers took 10 minutes

But when the map-reduce job was executed with CombinedInputFormat: 25 mappers took 33 minutes.

Any help will be appreciated.

标签： hadoop mapreduce

1条回答

可以哭但决不认输i

2楼-- · 2019-09-08 09:46

Hadoop performs better with a small number of large files, as opposed to a huge number of small files. ("Small" here means significantly smaller than a Hadoop Distributed File System (HDFS) block."Number" means ranging to 1000s).

That means if you have 1000 1Mb size file the Map-reduce job based on normal TextInputFormat will create 1000 Map tasks, each of these map tasks require certain amount of time to start and end. This latency in task creation can reduce the performance of the job

In a multi tenant cluster with resource limitation, getting large number of Map slots also will be difficult.

Please refer this link for more details and Benchmark results.

0人赞添加讨论(0) 举报

Hadoop: Does using CombineFileInputFormat for smal

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间