I was under the impression that combiners are just like reducers that act on the local map task, That is it aggregates the results of individual Map task in order to reduce the network bandwidth for output transfer.
And from reading Hadoop- The definitive guide 3rd edition
, my understanding seems correct.
From chapter 2 (page 34)
Combiner Functions Many MapReduce jobs are limited by the bandwidth available on the cluster, so it pays to minimize the data transferred between map and reduce tasks. Hadoop allows the user to specify a combiner function to be run on the map output—the combiner func- tion’s output forms the input to the reduce function. Since the combiner function is an optimization, Hadoop does not provide a guarantee of how many times it will call it for a particular map output record, if at all. In other words, calling the combiner function zero, one, or many times should produce the same output from the reducer.
So I tried the following on the wordcount problem:
job.setMapperClass(mapperClass);
job.setCombinerClass(reduceClass);
job.setNumReduceTasks(0);
Here is the counters:
14/07/18 10:40:15 INFO mapred.JobClient: Counters: 10
14/07/18 10:40:15 INFO mapred.JobClient: File System Counters
14/07/18 10:40:15 INFO mapred.JobClient: FILE: Number of bytes read=293
14/07/18 10:40:15 INFO mapred.JobClient: FILE: Number of bytes written=75964
14/07/18 10:40:15 INFO mapred.JobClient: FILE: Number of read operations=0
14/07/18 10:40:15 INFO mapred.JobClient: FILE: Number of large read operations=0
14/07/18 10:40:15 INFO mapred.JobClient: FILE: Number of write operations=0
14/07/18 10:40:15 INFO mapred.JobClient: Map-Reduce Framework
14/07/18 10:40:15 INFO mapred.JobClient: Map input records=7
14/07/18 10:40:15 INFO mapred.JobClient: Map output records=16
14/07/18 10:40:15 INFO mapred.JobClient: Input split bytes=125
14/07/18 10:40:15 INFO mapred.JobClient: Spilled Records=0
14/07/18 10:40:15 INFO mapred.JobClient: Total committed heap usage (bytes)=85000192
and here is part-m-00000
:
hello 1
world 1
Hadoop 1
programming 1
mapreduce 1
wordcount 1
lets 1
see 1
if 1
this 1
works 1
12345678 1
hello 1
world 1
mapreduce 1
wordcount 1
so clearly no combiner is applied. I understand that Hadoop does not guarantee if a combiner will be called at all. But when I turn on the reduce phase, the combiner gets called.
WHY IS THIS BEHAVIOR?
Now when I read chapter 6 (page 208) on how MapReduce works
. I see this paragraph described in the Reduce side
.
The map outputs are copied to the reduce task JVM’s memory if they are small enough (the buffer’s size is controlled by mapred.job.shuffle.input.buffer.percent, which specifies the proportion of the heap to use for this purpose); otherwise, they are copied to disk. When the in-memory buffer reaches a threshold size (controlled by mapred.job.shuffle.merge.percent), or reaches a threshold number of map outputs (mapred.inmem.merge.threshold), it is merged and spilled to disk. If a combiner is specified it will be run during the merge to reduce the amount of data written to disk.
My inferences from this paragraph are : 1) Combiner is ALSO run during the reduce phase.
A combiner will not run if it is a Map-Only job.
A combiner only runs if there are more than 3 spill files written to the disk.
The main function of a
combiner
is optimization. It acts like a mini-reducer for most cases. From page 206 of the same book, chapter - How mapreduce works(The map side):The quote from your question,
Both the quotes indicate that a
combiner
is run primarily for compactness. Reducing the network bandwidth for output transfer is an advantage of this optimization.Also, from the same book,
Meaning that hadoop doesn't guarentee how many times a combiner is run(could be zero also)
A combiner is never run for map-only jobs. It makes sense because, a combiner changes the map output. Also, since it doesn't guarantee the number of times it is called, the map output is not guaranteed to be the same either.