I am trying to understand the various phases which a MR Job goes through. I read online documentation for the same.
Based on this, my understand on the sequence is as below:
map() -> Partitioner -> Sorting (at mapper machine) -> Shuffle -> Sorting (at reducer machine) -> groupBy(Key) (at reducer machine) -> reduce()
Is this the correct sequence in which a MR Job executes?
Various phases of a map reduce job:
Map phase:
Reads assigned input split from HDFS
Parses input into records as key-value pairs
Applies map function to each record
Informs master node of its completion
Partition phase
Each mapper must determine which reducer will receive each of the outputs
For any key, destination partition is the same
No. of partitions = No. of reducers
Shuffle phase
- Fetches input data from all map tasks for the portion corresponding to the reduce task's bucket
Sort phase
- Merge sorts all map outputs into a single run
Reduce phase
Apply user defined reduce function to merged un
Argument are the key and corresponding list of values
Writes output to a file in HDFS
Timeline of Map Reduce Job
- Map Phase: several Map Tasks are executed
- Reduce Phase: several Reduce Tasks are executed
Timeline for MapTask
Timeline for ReduceTask
Image source : https://www.slideshare.net/EmilioCoppa/hadoop-internals