MapReduce job Output sort order

2019-01-19 17:28发布

i can see in my mapreduce jobs that the output of the reducer part is sorted by key ..

so if i have set number of reducers to 10, the output directory would contain 10 files and each of those output files have a sorted data.

the reason i am putting it here is that even though all the files have sorted data but these files itself are not sorted.. for example : there are scenarios where the part-000* files have started from 0 and end at zzzz assuming i am using Text as the key.

i was assumming that the file's should be sorted even within the files i.e file 1 should have a and the last file part--00009 should have entries with zzzz or atleaset > a

assuming if i have all the alphabets uniformally distributed keys.

could someone throw some light why such a behavior

4条回答
别忘想泡老子
2楼-- · 2019-01-19 18:04

Hive order by uses a single reducer, so you can use distribute by/ sort by and then from the sorted table you can do insert overwrite local from table -- to write the data into a file

查看更多
家丑人穷心不美
3楼-- · 2019-01-19 18:09
Q :all the files have sorted data but these files itself are not sorted..

Ans : Custom Hashpartitioner is used by default to partition the intermediate output (from mapper).

Ex:

If the intermediate values are 3,4,5,6,7,8,9,10,11
Then the data will be partitioned into (lets say) Reducer: 
R1{7,4,10}
R2{5,11,8}
R3{9,6,3}

So now the flat files will have

Part-00000 {4,,7,11}
Part-00001 {5,8,11}
Part-00002 {3,6,9}

If you are looking for sort-by-value : Here is the ans

查看更多
家丑人穷心不美
4楼-- · 2019-01-19 18:20

You can achieve a globally sorted file (which is what you basically want) using these methods:

  1. Use just one reducer in mapreduce (bad idea !! This puts too much work on one machine)
  2. Write a custom partitioner. Partioner is the class which divides the key space in mapreduce. The default partioner (Hashpartioner) evenly divides the key space into the number of reducers. Check out this example for writing a custom partioner.

  3. Use Hadoop Pig/Hive to do sort.

查看更多
Emotional °昔
5楼-- · 2019-01-19 18:21

Total Sort

All key value pairs from a particular Key will reach a particular reducer. This will happen through Partitioners at Mapper level. Combiners at Mapper level will act as Semi reducers and send values of a particular key to Reducer. HashPartitioner is best partitioner to decide the number of reducers.

The reducer output will be a single file having all the output sorted based on the key.

Secondary Sort

Used to define how map output keys are sorted. It works at Mapper level. In this case, we will be able to control the ordering of the values along with the keys.That is sorting can be done on two or more field values.

Have a look at Total order sorting & Secondary sorting

查看更多
登录 后发表回答