is the output of map phase of the mapreduce job al

2019-04-08 20:29发布

I am a bit confused with the output I get from Mapper.

For example, when I run a simple wordcount program, with this input text:

hello world
Hadoop programming
mapreduce wordcount
lets see if this works
12345678
hello world
mapreduce wordcount

this is the output that I get:

12345678    1
Hadoop  1
hello   1
hello   1
if  1
lets    1
mapreduce   1
mapreduce   1
programming 1
see 1
this    1
wordcount   1
wordcount   1
works   1
world   1
world   1

As you can see, the output from mapper is already sorted. I did not run Reducer at all. But I find in a different project that the output from mapper is not sorted. So I am totally clear about this..

My questions are:

  1. Is the mapper's output always sorted?
  2. Is the sort phase integrated into the mapper phase already, so that the output of map phase is already sorted in the intermediate data?
  3. Is there a way to collect the data from sort and shuffle phase and persist it before it goes to Reducer? A reducer is presented with a key and a list of iterables. Is there a way, I could persist this data?

5条回答
贪生不怕死
2楼-- · 2019-04-08 20:57

1. Is the mapper's output always sorted?

2.Is the sort phase integrated into the mapper phase already, so that the output of map phase is already sorted in the intermediate data?

From Apache MapReduceTutorial:

( Under Mapper Section )

All intermediate values associated with a given output key are subsequently grouped by the framework, and passed to the Reducer(s) to determine the final output.

The Mapper outputs are sorted and then partitioned per Reducer. The total number of partitions is the same as the number of reduce tasks for the job

( Under Reducer Section )

Reducer NONE

It is legal to set the number of reduce-tasks to zero if no reduction is desired.

In this case the outputs of the map-tasks go directly to the FileSystem, into the output path set by FileOutputFormat.setOutputPath(Job, Path). The framework does not sort the map-outputs before writing them out to the FileSystem.

3. Is there a way to collect the data from sort and shuffle phase and persist it before it goes to Reducer? A reducer is presented with a key and a list of iterables. Is there a way, I could persist this data?

I don't think so. From Apache condemnation on Reducer:

Reducer has 3 primary phases:

Shuffle:

The Reducer copies the sorted output from each Mapper using HTTP across the network.

Sort: The framework merge sorts Reducer inputs by keys (since different Mappers may have output the same key).

The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they are merged.

Reduce:

The output of the reduce task is typically written to a RecordWriter via TaskInputOutputContext.write(Object, Object).

The output of the Reducer is not re-sorted.

As per the documentation, the shuffle and sort phase is driven by framework

If you want to persist the data, set number of reducers to Zero, which causes persistence of Map output into HDFS but it won't sort the data.

Have a look at related SE question:

hadoop: difference between 0 reducer and identity reducer?

I did not find IdentityReducer in Hadoop 2.x version:

identityreducer in the new Hadoop API


查看更多
倾城 Initia
3楼-- · 2019-04-08 21:00

Is the mapper's output always sorted?

No. It is not sorted if you use no reducer. If you use a reducer, there is a pre-sorting process before the mapper's output is written to disk. Data gets sorted in the Reduce phase. What is happening here (just a guess) is that you are not specifying a Reducer class, which, in the new API, is translated into using the Identity Reducer (see this answer and comment). The Identity Reducer just outputs its input. To verify that, see the default Reducer counters (there should be some reduce tasks, reduce input records & groups, reduce output records...)

Is the sort phase integrated into the mapper phase already, so that the output of map phase is already sorted in the intermediate data?

As I explained in the previous question, if you use no reducers, mapper does not sort the data. If you do use reducers, the data start getting sorted from the map phase and then get merge-sorted in the reduce phase.

Is there a way to collect the data from sort and shuffle phase and persist it before it goes to Reducer. A reducer is presented with a key and a list of iterables. Is there a way, I could persist this data?

Again, shuffling and sorting are parts of the Reduce phase. An Identity Reducer will do what you want. If you want to output one key-value pair per reducer, with the values being a concatenation of the iterables, just store the iterables in memory (e.g. in a StringBuffer) and then output this concatenation as a value. If you want the map output to go straight to the program's output, without going through a reduce phase, then set in the driver class the number of reduce tasks to zero, like that:

job.setNumReduceTasks(0);

This will not get your output sorted, though. It will skip the pre-sorting process of the mapper and write the output directly to HDFS.

查看更多
小情绪 Triste *
4楼-- · 2019-04-08 21:11

Following would be some explanations to your questions

  • Heading ##Does the output from mapper is always sorted?

    Already answered by @SurJanSR

  • Heading ##Does the sort phase integrated with mapper phase already, so that the output of map phase is already sorted in the intermediate data?

    In a Mapreduce Job, as you know, Mapper runs on individual splits of data and across nodes where data is persisting. The result of Mapper is written TEMPORARILY before it is written to the next phase.

  • In the case of a reduce operation, the TEMPORARILY stored Mapper output is sorted, shuffle based on the partitioner needs before moved to the reduce operation

  • In the case of Map Only Job, as in your case, The temorarily stored Mapper output is sorted based on the key and written to the final output folder (as specified in your arguments for the Job).

  • Heading ##Is there a way to collect the data from sort and shuffle phase and persist it before it goes to Reducer. A reducer is presented with a key and a list of iterables. Is there a way, I could persist this data?

    Not sure what your requirement is. using a IdentityReducer would just persist the output. I'm not sure if this answers your question.

查看更多
倾城 Initia
5楼-- · 2019-04-08 21:12

Point 1: output from mapper is always sorted but based on Key. i.e. if Map method is doing this: context.write(outKey, outValue); then result will be sorted based on outKey.

查看更多
做个烂人
6楼-- · 2019-04-08 21:13

I support the answer of vefthym. Usually the Mapper output is sorted before storing it locally on the node. But when you are explicitely setting up numReduceTasks to 0 in the job configuration then the mapper o/p will not be sorted and written directly to HDFS. So we cannot say that Mapper output is always sorted!

查看更多
登录 后发表回答