I am new to hadoop and peforming some tests on local machine.
There have been many solutions to deal with many small files. I am using CombinedInputFormat which extends CombineFileInputFormat.
I see that number of mapper have changed from 100 to 25 with CombinedInputFormat. Should I also expect any performance gain since number of mappers have reduced?
I have performed the map-reduce job on many small files without CombinedInputFormat: 100 mappers took 10 minutes
But when the map-reduce job was executed with CombinedInputFormat: 25 mappers took 33 minutes.
Any help will be appreciated.
Hadoop performs better with a small number of large files, as opposed to a huge number of small files. ("Small" here means significantly smaller than a Hadoop Distributed File System (HDFS) block."Number" means ranging to 1000s).
That means if you have 1000 1Mb size file the Map-reduce job based on normal
TextInputFormat
will create 1000 Map tasks, each of these map tasks require certain amount of time to start and end. This latency in task creation can reduce the performance of the jobIn a multi tenant cluster with resource limitation, getting large number of Map slots also will be difficult.
Please refer this link for more details and Benchmark results.