Hadoop - Globally sort mean and when is happen in

2019-05-06 17:59发布

I am using Hadoop streaming JAR for WordCount, I want to know how can I get Globally Sort, according to answer on another question in SO, I found that when we use of just one reducer we can get Globally sort but in my result with numReduceTasks=1 (one reducer) it is not sort.

For example, my input to mapper is:

file 1: A long time ago in a galaxy far far away

file 2: Another episode for Star Wars

Result is:

A 1

a 1

Star 1

ago 1

for 1

far 2

away 1

time 1

Wars 1

long 1

Another 1

in 1

episode 1

galaxy 1

But this is no a Globally Sort!

So, What is meaning of Sort in Shuffle and Sort and Globally Sort?

mapper code:

    #!/usr/bin/env python
    import sys
    for line in sys.stdin:  
    line = line.strip()    
    words = line.split()    
    for word in words:
        print '%s\t%s' % (word, 1)

reducer code:

#!/usr/bin/env python

import sys

word2count = {} 

for line in sys.stdin:

    line = line.strip()

    word, count = line.split('\t', 1)

    try:
        count = int(count)
    except ValueError:
        continue

    try:
        word2count[word] = word2count[word]+count
    except:
        word2count[word] = count

for word in word2count.keys():
    print '%s\t%s'% ( word, word2count[word] )

I use this command to run it:

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-input /user/cloudera/input \
-output /user/cloudera/output_new_0 \
-mapper /home/cloudera/wordcount_mapper.py \
-reducer /home/cloudera/wordcount_reducer.py \
-numReduceTasks=1

0条回答
登录 后发表回答