where does combiners combine mapper outputs - in m

2019-02-14 00:19发布

I was under the impression that combiners are just like reducers that act on the local map task, That is it aggregates the results of individual Map task in order to reduce the network bandwidth for output transfer.

And from reading Hadoop- The definitive guide 3rd edition, my understanding seems correct.

From chapter 2 (page 34)

Combiner Functions Many MapReduce jobs are limited by the bandwidth available on the cluster, so it pays to minimize the data transferred between map and reduce tasks. Hadoop allows the user to specify a combiner function to be run on the map output—the combiner func- tion’s output forms the input to the reduce function. Since the combiner function is an optimization, Hadoop does not provide a guarantee of how many times it will call it for a particular map output record, if at all. In other words, calling the combiner function zero, one, or many times should produce the same output from the reducer.

So I tried the following on the wordcount problem:

job.setMapperClass(mapperClass);
job.setCombinerClass(reduceClass);
job.setNumReduceTasks(0);

Here is the counters:

14/07/18 10:40:15 INFO mapred.JobClient: Counters: 10
14/07/18 10:40:15 INFO mapred.JobClient:   File System Counters
14/07/18 10:40:15 INFO mapred.JobClient:     FILE: Number of bytes read=293
14/07/18 10:40:15 INFO mapred.JobClient:     FILE: Number of bytes written=75964
14/07/18 10:40:15 INFO mapred.JobClient:     FILE: Number of read operations=0
14/07/18 10:40:15 INFO mapred.JobClient:     FILE: Number of large read operations=0
14/07/18 10:40:15 INFO mapred.JobClient:     FILE: Number of write operations=0
14/07/18 10:40:15 INFO mapred.JobClient:   Map-Reduce Framework
14/07/18 10:40:15 INFO mapred.JobClient:     Map input records=7
14/07/18 10:40:15 INFO mapred.JobClient:     Map output records=16
14/07/18 10:40:15 INFO mapred.JobClient:     Input split bytes=125
14/07/18 10:40:15 INFO mapred.JobClient:     Spilled Records=0
14/07/18 10:40:15 INFO mapred.JobClient:     Total committed heap usage (bytes)=85000192

and here is part-m-00000:

hello   1
world   1
Hadoop  1
programming 1
mapreduce   1
wordcount   1
lets    1
see 1
if  1
this    1
works   1
12345678    1
hello   1
world   1
mapreduce   1
wordcount   1

so clearly no combiner is applied. I understand that Hadoop does not guarantee if a combiner will be called at all. But when I turn on the reduce phase, the combiner gets called.

WHY IS THIS BEHAVIOR?

Now when I read chapter 6 (page 208) on how MapReduce works. I see this paragraph described in the Reduce side.

The map outputs are copied to the reduce task JVM’s memory if they are small enough (the buffer’s size is controlled by mapred.job.shuffle.input.buffer.percent, which specifies the proportion of the heap to use for this purpose); otherwise, they are copied to disk. When the in-memory buffer reaches a threshold size (controlled by mapred.job.shuffle.merge.percent), or reaches a threshold number of map outputs (mapred.inmem.merge.threshold), it is merged and spilled to disk. If a combiner is specified it will be run during the merge to reduce the amount of data written to disk.

My inferences from this paragraph are : 1) Combiner is ALSO run during the reduce phase.

2条回答
萌系小妹纸
2楼-- · 2019-02-14 00:32
  1. A combiner will not run if it is a Map-Only job.

  2. A combiner only runs if there are more than 3 spill files written to the disk.

查看更多
贼婆χ
3楼-- · 2019-02-14 00:33

The main function of a combiner is optimization. It acts like a mini-reducer for most cases. From page 206 of the same book, chapter - How mapreduce works(The map side):

Running the combiner function makes for a more compact map output, so there is less data to write to local disk and to transfer to the reducer.

The quote from your question,

If a combiner is specified it will be run during the merge to reduce the amount of data written to disk.

Both the quotes indicate that a combiner is run primarily for compactness. Reducing the network bandwidth for output transfer is an advantage of this optimization.

Also, from the same book,

Recall that combiners may be run repeatedly over the input without affecting the final result. If there are only one or two spills, then the potential reduction in map output size is not worth the overhead in invoking the combiner, so it is not run again for this map output.

Meaning that hadoop doesn't guarentee how many times a combiner is run(could be zero also)

A combiner is never run for map-only jobs. It makes sense because, a combiner changes the map output. Also, since it doesn't guarantee the number of times it is called, the map output is not guaranteed to be the same either.

查看更多
登录 后发表回答