I've realized that when running Hadoop with Python code, either the mapper or reducer (not sure which) is sorting my output before it's printed out by reducer.py. Currently it seems to be sorted alphanumerically. I am wondering if there is a way to completely disable this. I would like the output of the program based off of the order in which it's printed from mapper.py. I've found answers in Java but none for Python. Would I need to modify mapper.py or perhaps the command line arguments?
相关问题
- how to define constructor for Python's new Nam
- streaming md5sum of contents of a large remote tar
- How to toggle on Order in ReactJS
- How to get the background from multiple images by
- PHP Recursively File Folder Scan Sorted by Modific
No, as stated
here:You should read more on basic MapReduce concepts. Even though the sorting may be unnecessary in some cases, the shuffling part of the "Shuffle & Sort" phase is an intrinsic part of the MapReduce model. The MapReduce framework (Hadoop) needs to group the output of the mappers so that it sends all the keys together to one single reducer, so that the reducer can actually "reduce" the data. When using streaming, the key value pairs are--by default--separated by a tab value. From your sample code in other SO questions, I can see that you are not providing producing "key, value" tuples, but rather just single text lines.
EDIT: Added the following answer to the question "How to make it sort numerically (e.g., 9 before 10)?"
Alternative 1: Prepend zeroes to your keys so that they all have the same size. "09" comes before "10".
Alternative 2: Use the
KeyFieldBasedComparator
, as indicated in this SO question.