Who will get a chance to execute first , Combiner

2019-05-15 06:39发布

问题:

I'm getting confused after reading below article on Hadoop- Definitive guide 4th edition(page-204)

  • Before it writes to disk, the thread first divides the data into partitions corresponding to the reducers that they will ultimately be sent to.

  • Within each partition, the background thread performs an in-memory sort by key, and if there is a combiner function, it is run on the output of the sort.

  • Running the combiner function makes for a more compact map output, so there is less data to write to local disk and to transfer to the reducer.

Here is my doubt:

1) Who will execute first combiner or partitions !!

2) When custom combiner and custom partitions will be there so how and what will be the execution steps hierarchy ?

3) Can we feed compress data (avro ,sequence ..etc) to Custom combiner ,if yes then how!!

Looking for a brief and in-depth explanation!!

Thanks in advance.

回答1:

1/ The response is already specified in this part: "Within each partition, the background thread performs an in-memory sort by key, and if there is a combiner function, it is run on the output of the sort."

So firstly the partitions are created in-memory, if there is a custom combiner, it will be executed in-memory, and the result will be spilled to disk at the end.

2/ custom combiner and custom partition will be there when they are specified on the driver class.

job.setCombinerClass(MyCombiner.class);
job.setPartitionerClass(MyPartitioner.class);

If there is no custom combiner specified, so there is no combiner executed. If there is no custom partitioner specified, so the default executed partitioner is "HashPartitioner" (please see the page 221 for that).

3/ Yes, it is possible. Don't forget that the mechanism of the combiner is the same than the reducer. The reducer can consume compressed data. If the consumer consumes the compressed data, that means that the input files format is compressed. for that, you can specify on the driver class the instruction:

Sequence File case: job.setInputFormatClass(SequenceFileInputFormat.class);
Avro File case: job.setInputFormatClass(AvroKeyInputFormat.class); 


回答2:

The direct answer to your question is => COMBINER

Details: Combiner can be viewed as mini-reducers in the map phase. They perform a local-reduce on the mapper results before they are distributed further. Once the Combiner functionality is executed, it is then passed on to the Reducer for further work.

where as

Partitioner come into the picture when we are working on more than on Reducer. So, the partitioner decide which reucer is responsible for a particular key. They basically take the Mapper Result(if Combiner is used then Combiner Result) and send it to the responsible Reducer based on the key.

For a better understanding you can refer the following image, which I have taken from Yahoo Developer Tutorial on Hadoop. Figure 4.6: Combiner step inserted into the MapReduce data flow

Here is the tutorial .



回答3:

This is the complete MR job flow. Your 1.) and 2.) is answered here.

  1. Mapper reads the data and processes. This output goes to a intermediate output file.
  2. Once mapper finishes all the key, values pairs. The intermediate output is partitioned into 'R' partitions using either default partitioner 'HashPartitioner' or custom partitioner.
  3. Each partitioned file is sorted.
  4. Any optional combiner code is executed on the sorted 'R' partitions. The combiner step is executed only if it is specified.
  5. Reducers reach out to the mappers and pull their appropriate partitioned files.
  6. After all the mapper tasks completed and all the intermediate data is copied to all the reducers. The reducers perform one more sort on the data.
  7. Then reducers work on their individual key, value pairs one by one.

Answer-3: Yes, combiner can process the compressed data. The combiner function runs on the output of the map phase and is used as a filtering or an aggregating step to lessen the number of intermediate keys that are being passed to the reducer. In most of the cases the reducer class is set to be the combiner class. The difference lies in the output from these classes. The output of the combiner class is the intermediate data that is passed to the reducer whereas the output of the reducer is passed to the output file on disk. The combiner for job can be set like this:

job.setCombinerClass(CustomCombiner.class);


回答4:

Partition runs before the Combinor. a) The mapper will processed the data into b) Followed by a partitioner ( either default or custom ) will partitioned the data as per requirement based on keys. c) Followed by sorting on keys which will be taken care by the background threads/process. d) If combinor exist : Then followed by combinor,This will run on the output of the sorted keys e) Followed by the Reducer which will run sort one more time on the input data followed by the reducer process.



回答5:

I would like to summarize the entire flow:

  1. Mapper reads the data and processes. This output goes to a intermediate output file.
  2. Once mapper finishes all the key, values pairs.
  3. output of Mapper first writen to memory buffer,
  4. when buffer is about to overflow then spilled to local dir and then partitions are created in-memory["Within each partition, the background thread performs an in-memory sort by key and The intermediate output is partitioned into 'R' partitions using either default partitioner 'HashPartitioner' or custom partitioner]
  5. The spilling data is parted according to Partitioner, and in each partition the result is sorted and
  6. if there is a custom combiner, it will be executed in-memory, and the result will be spilled to disk at the end.
  7. Reducers reach out to the mappers and pull their appropriate partitioned files.
  8. After all the mapper tasks completed and all the intermediate data is copied to all the reducers. The reducers perform one more sort on the data.
  9. Then reducers work on their individual key, value pairs one by one.

Please suggest if any gap in my understanding