I am running a Hadoop program and have the following as my input file, input.txt:
1
2
mapper.py:
import sys
for line in sys.stdin:
print line,
print "Test"
reducer.py:
import sys
for line in sys.stdin:
print line,
When I run it without Hadoop: $ cat ./input.txt | ./mapper.py | ./reducer.py, the output is as expected:
1
2
Test
However, running it through Hadoop via the streaming API (as described here), the latter part of the output seems somewhat "doubled":
1
2
Test
Test
Aditionally, when I run the program through Hadoop, it seems like it has a 1/4 chance of failing due to this:
Error: # of failed Map Tasks exceeded allowed limit. FailedCount: 1.
I've looked at this for some time and can't figure out what I'm not getting. If anyone could help with these issues, I would greatly appreciate it! Thanks.
edit: When input.txt is:
1
2
3
4
5
6
7
8
9
10
The output is:
1
10
2
3
4
5
6
7
8
9
Test
Test