Correct order of various phases of MR job?

2019-07-27 09:11发布

问题:

I am trying to understand the various phases which a MR Job goes through. I read online documentation for the same.

Based on this, my understand on the sequence is as below:

map() -> Partitioner -> Sorting (at mapper machine) -> Shuffle -> Sorting (at reducer machine) -> groupBy(Key) (at reducer machine) -> reduce()

Is this the correct sequence in which a MR Job executes?

回答1:

Various phases of a map reduce job:

Map phase:

  • Reads assigned input split from HDFS

  • Parses input into records as key-value pairs

  • Applies map function to each record

  • Informs master node of its completion

Partition phase

  • Each mapper must determine which reducer will receive each of the outputs

  • For any key, destination partition is the same

  • No. of partitions = No. of reducers

Shuffle phase

  • Fetches input data from all map tasks for the portion corresponding to the reduce task's bucket

Sort phase

  • Merge sorts all map outputs into a single run

Reduce phase

  • Apply user defined reduce function to merged un

  • Argument are the key and corresponding list of values

  • Writes output to a file in HDFS



回答2:

Timeline of Map Reduce Job

  • Map Phase: several Map Tasks are executed
  • Reduce Phase: several Reduce Tasks are executed

Timeline for MapTask

Timeline for ReduceTask

Image source : https://www.slideshare.net/EmilioCoppa/hadoop-internals