I am learning elastic mapreduce and started off with the Word Splitter example provided in the Amazon Tutorial Section(code shown below). The example produces word count for all the words in all the input documents provided.
But I want to get output for Word Counts by file names i.e the count of a word in just one particular document. Since the python code for word count takes input from stdin, how do I tell which input line came from which document ?
Thanks.
#!/usr/bin/python
import sys
import re
def main(argv):
line = sys.stdin.readline()
pattern = re.compile("[a-zA-Z][a-zA-Z0-9]*")
try:
while line:
for word in pattern.findall(line):
print "LongValueSum:" + word.lower() + "\t" + "1"
line = sys.stdin.readline()
except "end of file":
return None
if __name__ == "__main__":
main(sys.argv)
In the typical WordCount example, the file name which the map file is processing is ignored, since the the job output contains the consolidated word count for all the input files and not at a file level. But to get the word count at a file level, the input file name has to be used. Mappers using Python can get the file name using the os.environ["map.input.file"]
command. The list of task execution environment variables is here.
The mapper instead of just emitting the key/value pair as <Hello, 1>
, should also contain the input file name being processed. The following can be the emitted by the map <input.txt, <Hello, 1>>
, where input.txt is the key and <Hello, 1>
is the value.
Now, all the word counts for a particular file will be processed by a single reducer. The reducer must then aggregate the word count for that particular file.
As usual, a Combiner would help to decrease the network chatter between the mapper and the reducer and also to complete the job faster.
Check Data-Intensive Text Processing with MapReduce for more algorithms on text processing.