Hadoop: Output file has double output

2019-08-29 07:52发布

问题:

I am running a Hadoop program and have the following as my input file, input.txt:

1
2

mapper.py:

import sys
for line in sys.stdin:
    print line,
print "Test"

reducer.py:

import sys
for line in sys.stdin:
    print line,

When I run it without Hadoop: $ cat ./input.txt | ./mapper.py | ./reducer.py, the output is as expected:

1
2
Test

However, running it through Hadoop via the streaming API (as described here), the latter part of the output seems somewhat "doubled":

1
2
Test    
Test

Aditionally, when I run the program through Hadoop, it seems like it has a 1/4 chance of failing due to this:

Error: # of failed Map Tasks exceeded allowed limit. FailedCount: 1.

I've looked at this for some time and can't figure out what I'm not getting. If anyone could help with these issues, I would greatly appreciate it! Thanks.

edit: When input.txt is:

1
2
3
4
5
6
7
8
9
10

The output is:

1   
10  
2   
3   
4   
5   
6   
7   
8   
9   
Test    
Test

回答1:

It gives the same output. I guess you are specifying the location of reducer to mapper.py only. Make sure you are providing correct path to reducer.py