I am trying to understand the various phases which a MR Job goes through. I read online documentation for the same.
Based on this, my understand on the sequence is as below:
map() -> Partitioner -> Sorting (at mapper machine) -> Shuffle -> Sorting (at reducer machine) -> groupBy(Key) (at reducer machine) -> reduce()
Is this the correct sequence in which a MR Job executes?
Various phases of a map reduce job:
Map phase:
Reads assigned input split from HDFS
Parses input into records as key-value pairs
Applies map function to each record
Informs master node of its completion
Partition phase
Each mapper must determine which reducer will receive each of the outputs
For any key, destination partition is the same
No. of partitions = No. of reducers
Shuffle phase
Sort phase
Reduce phase
Apply user defined reduce function to merged un
Argument are the key and corresponding list of values
Writes output to a file in HDFS
Timeline of Map Reduce Job
Timeline for MapTask
Timeline for ReduceTask
Image source : https://www.slideshare.net/EmilioCoppa/hadoop-internals