I'm getting confused after reading below article on Hadoop- Definitive guide 4th edition(page-204)
Before it writes to disk, the thread first divides the data into partitions corresponding to the reducers that they will ultimately be sent to.
Within each partition, the background thread performs an in-memory sort by key, and if there is a combiner function, it is run on the output of the sort.
Running the combiner function makes for a more compact map output, so there is less data to write to local disk and to transfer to the reducer.
Here is my doubt:
1) Who will execute first combiner or partitions !!
2) When custom combiner and custom partitions will be there so how and what will be the execution steps hierarchy ?
3) Can we feed compress data (avro ,sequence ..etc) to Custom combiner ,if yes then how!!
Looking for a brief and in-depth explanation!!
Thanks in advance.
1/ The response is already specified in this part: "Within each partition, the background thread performs an in-memory sort by key, and if there is a combiner function, it is run on the output of the sort."
So firstly the partitions are created in-memory, if there is a custom combiner, it will be executed in-memory, and the result will be spilled to disk at the end.
2/ custom combiner and custom partition will be there when they are specified on the driver class.
If there is no custom combiner specified, so there is no combiner executed. If there is no custom partitioner specified, so the default executed partitioner is "HashPartitioner" (please see the page 221 for that).
3/ Yes, it is possible. Don't forget that the mechanism of the combiner is the same than the reducer. The reducer can consume compressed data. If the consumer consumes the compressed data, that means that the input files format is compressed. for that, you can specify on the driver class the instruction:
This is the complete MR job flow. Your 1.) and 2.) is answered here.
Answer-3: Yes, combiner can process the compressed data. The combiner function runs on the output of the map phase and is used as a filtering or an aggregating step to lessen the number of intermediate keys that are being passed to the reducer. In most of the cases the reducer class is set to be the combiner class. The difference lies in the output from these classes. The output of the combiner class is the intermediate data that is passed to the reducer whereas the output of the reducer is passed to the output file on disk. The combiner for job can be set like this:
The direct answer to your question is => COMBINER
Details: Combiner can be viewed as mini-reducers in the map phase. They perform a local-reduce on the mapper results before they are distributed further. Once the Combiner functionality is executed, it is then passed on to the Reducer for further work.
where as
Partitioner come into the picture when we are working on more than on Reducer. So, the partitioner decide which reucer is responsible for a particular key. They basically take the Mapper Result(if Combiner is used then Combiner Result) and send it to the responsible Reducer based on the key.
For a better understanding you can refer the following image, which I have taken from Yahoo Developer Tutorial on Hadoop. Figure 4.6: Combiner step inserted into the MapReduce data flow
Here is the tutorial .
Partition runs before the Combinor. a) The mapper will processed the data into b) Followed by a partitioner ( either default or custom ) will partitioned the data as per requirement based on keys. c) Followed by sorting on keys which will be taken care by the background threads/process. d) If combinor exist : Then followed by combinor,This will run on the output of the sorted keys e) Followed by the Reducer which will run sort one more time on the input data followed by the reducer process.
I would like to summarize the entire flow:
Please suggest if any gap in my understanding