I'm running a fairly big MRJob job (1,755,638 keys) and the keys are being written to the reducers in sorted order. This happens even if I specify that Hadoop should use the hash partitioner, with:
class SubClass(MRJob):
PARTITIONER = "org.apache.hadoop.mapred.lib.HashPartitioner"
...
I don't understand why the keys are sorted, when I am not asking for them to be sorted.
The HashPartitioner is used by default when you don't specify any partitioner explicitly.
Keys are not sorted by default, but the HashPartitioner will give the appearance of sorting keys if the dataset is small. When I increased the size of the dataset from 50M to 10G the keys stopped being sorted.
MR sorts the key/value pairs by key so that it can ensure that all values for a given key are passed to the reducer together. In fact, the Iterable passed into the reduce() method just reads that sorted list until it finds a new key and then it stops iterating. That's why the keys will always appear in order.