Hadoop and Python: Disable Sorting

2019-02-25 19:33发布

I've realized that when running Hadoop with Python code, either the mapper or reducer (not sure which) is sorting my output before it's printed out by reducer.py. Currently it seems to be sorted alphanumerically. I am wondering if there is a way to completely disable this. I would like the output of the program based off of the order in which it's printed from mapper.py. I've found answers in Java but none for Python. Would I need to modify mapper.py or perhaps the command line arguments?

标签： python sorting hadoop mapreduce cluster-computing

2条回答

乱世女痞

2楼-- · 2019-02-25 19:38

No, as stated ~~here~~:

If your number of reduce tasks is not 0, the hadoop framework will sort your results. There is no way around it.

0人赞添加讨论(0) 举报

仙女界的扛把子

3楼-- · 2019-02-25 19:45

You should read more on basic MapReduce concepts. Even though the sorting may be unnecessary in some cases, the shuffling part of the "Shuffle & Sort" phase is an intrinsic part of the MapReduce model. The MapReduce framework (Hadoop) needs to group the output of the mappers so that it sends all the keys together to one single reducer, so that the reducer can actually "reduce" the data. When using streaming, the key value pairs are--by default--separated by a tab value. From your sample code in other SO questions, I can see that you are not providing producing "key, value" tuples, but rather just single text lines.

EDIT: Added the following answer to the question "How to make it sort numerically (e.g., 9 before 10)?"

Alternative 1: Prepend zeroes to your keys so that they all have the same size. "09" comes before "10".

Alternative 2: Use the KeyFieldBasedComparator, as indicated in this SO question.

0人赞添加讨论(0) 举报

Hadoop and Python: Disable Sorting

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间