I am using Hadoop streaming JAR for WordCount, I want to know how can I get Globally Sort, according to answer on another question in SO, I found that when we use of just one reducer we can get Globally sort but in my result with numReduceTasks=1
(one reducer) it is not sort.
For example, my input to mapper is:
file 1: A long time ago in a galaxy far far away
file 2: Another episode for Star Wars
Result is:
A 1
a 1
Star 1
ago 1
for 1
far 2
away 1
time 1
Wars 1
long 1
Another 1
in 1
episode 1
galaxy 1
But this is no a Globally Sort!
So, What is meaning of Sort in Shuffle and Sort and Globally Sort?
mapper code:
#!/usr/bin/env python
import sys
for line in sys.stdin:
line = line.strip()
words = line.split()
for word in words:
print '%s\t%s' % (word, 1)
reducer code:
#!/usr/bin/env python
import sys
word2count = {}
for line in sys.stdin:
line = line.strip()
word, count = line.split('\t', 1)
try:
count = int(count)
except ValueError:
continue
try:
word2count[word] = word2count[word]+count
except:
word2count[word] = count
for word in word2count.keys():
print '%s\t%s'% ( word, word2count[word] )
I use this command to run it:
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-input /user/cloudera/input \
-output /user/cloudera/output_new_0 \
-mapper /home/cloudera/wordcount_mapper.py \
-reducer /home/cloudera/wordcount_reducer.py \
-numReduceTasks=1