I am a bit confused with the output I get from Mapper.
For example, when I run a simple wordcount program, with this input text:
hello world
Hadoop programming
mapreduce wordcount
lets see if this works
12345678
hello world
mapreduce wordcount
this is the output that I get:
12345678 1
Hadoop 1
hello 1
hello 1
if 1
lets 1
mapreduce 1
mapreduce 1
programming 1
see 1
this 1
wordcount 1
wordcount 1
works 1
world 1
world 1
As you can see, the output from mapper is already sorted. I did not run Reducer
at all.
But I find in a different project that the output from mapper is not sorted.
So I am totally clear about this..
My questions are:
- Is the mapper's output always sorted?
- Is the sort phase integrated into the mapper phase already, so that the output of map phase is already sorted in the intermediate data?
- Is there a way to collect the data from
sort and shuffle
phase and persist it before it goes to Reducer? A reducer is presented with a key and a list of iterables. Is there a way, I could persist this data?
1. Is the mapper's output always sorted?
2.Is the sort phase integrated into the mapper phase already, so that the output of map phase is already sorted in the intermediate data?
From Apache MapReduceTutorial:
( Under
Mapper
Section )( Under
Reducer
Section )3. Is there a way to collect the data from sort and shuffle phase and persist it before it goes to Reducer? A reducer is presented with a key and a list of iterables. Is there a way, I could persist this data?
I don't think so. From Apache condemnation on Reducer:
Shuffle:
The Reducer copies the sorted output from each Mapper using HTTP across the network.
Sort: The framework merge sorts Reducer inputs by keys (since different Mappers may have output the same key).
The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they are merged.
Reduce:
The output of the reduce task is typically written to a
RecordWriter
viaTaskInputOutputContext.write(Object, Object)
.The output of the Reducer is not re-sorted.
As per the documentation, the shuffle and sort phase is driven by framework
If you want to persist the data, set number of reducers to Zero, which causes persistence of Map output into HDFS but it won't sort the data.
Have a look at related SE question:
hadoop: difference between 0 reducer and identity reducer?
I did not find IdentityReducer in Hadoop 2.x version:
identityreducer in the new Hadoop API
No. It is not sorted if you use no reducer. If you use a reducer, there is a pre-sorting process before the mapper's output is written to disk. Data gets sorted in the Reduce phase. What is happening here (just a guess) is that you are not specifying a Reducer class, which, in the new API, is translated into using the Identity Reducer (see this answer and comment). The Identity Reducer just outputs its input. To verify that, see the default Reducer counters (there should be some reduce tasks, reduce input records & groups, reduce output records...)
As I explained in the previous question, if you use no reducers, mapper does not sort the data. If you do use reducers, the data start getting sorted from the map phase and then get merge-sorted in the reduce phase.
Again, shuffling and sorting are parts of the Reduce phase. An Identity Reducer will do what you want. If you want to output one key-value pair per reducer, with the values being a concatenation of the iterables, just store the iterables in memory (e.g. in a StringBuffer) and then output this concatenation as a value. If you want the map output to go straight to the program's output, without going through a reduce phase, then set in the driver class the number of reduce tasks to zero, like that:
This will not get your output sorted, though. It will skip the pre-sorting process of the mapper and write the output directly to HDFS.
Following would be some explanations to your questions
Heading ##Does the output from mapper is always sorted?
Already answered by @SurJanSR
Heading ##Does the sort phase integrated with mapper phase already, so that the output of map phase is already sorted in the intermediate data?
In a Mapreduce Job, as you know, Mapper runs on individual splits of data and across nodes where data is persisting. The result of Mapper is written TEMPORARILY before it is written to the next phase.
In the case of a reduce operation, the TEMPORARILY stored Mapper output is sorted, shuffle based on the partitioner needs before moved to the reduce operation
In the case of Map Only Job, as in your case, The temorarily stored Mapper output is sorted based on the key and written to the final output folder (as specified in your arguments for the Job).
Heading ##Is there a way to collect the data from sort and shuffle phase and persist it before it goes to Reducer. A reducer is presented with a key and a list of iterables. Is there a way, I could persist this data?
Not sure what your requirement is. using a IdentityReducer would just persist the output. I'm not sure if this answers your question.
Point 1: output from mapper is always sorted but based on Key. i.e. if Map method is doing this:
context.write(outKey, outValue);
then result will be sorted based onoutKey
.I support the answer of vefthym. Usually the Mapper output is sorted before storing it locally on the node. But when you are explicitely setting up numReduceTasks to 0 in the job configuration then the mapper o/p will not be sorted and written directly to HDFS. So we cannot say that Mapper output is always sorted!