I've realized that when running Hadoop with Python code, either the mapper or reducer (not sure which) is sorting my output before it's printed out by reducer.py. Currently it seems to be sorted alphanumerically. I am wondering if there is a way to completely disable this. I would like the output of the program based off of the order in which it's printed from mapper.py. I've found answers in Java but none for Python. Would I need to modify mapper.py or perhaps the command line arguments?
问题:
回答1:
You should read more on basic MapReduce concepts. Even though the sorting may be unnecessary in some cases, the shuffling part of the "Shuffle & Sort" phase is an intrinsic part of the MapReduce model. The MapReduce framework (Hadoop) needs to group the output of the mappers so that it sends all the keys together to one single reducer, so that the reducer can actually "reduce" the data. When using streaming, the key value pairs are--by default--separated by a tab value. From your sample code in other SO questions, I can see that you are not providing producing "key, value" tuples, but rather just single text lines.
EDIT: Added the following answer to the question "How to make it sort numerically (e.g., 9 before 10)?"
Alternative 1: Prepend zeroes to your keys so that they all have the same size. "09" comes before "10".
Alternative 2: Use the KeyFieldBasedComparator
, as indicated in this SO question.
回答2:
No, as stated here:
If your number of reduce tasks is not 0, the hadoop framework will sort your results. There is no way around it.