Several places say the default # of reducers in a Hadoop job is 1. You can use the mapred.reduce.tasks symbol to manually set the number of reducers.
When I run a Hive job (on Amazon EMR, AMI 2.3.3), it has some number of reducers greater than one. Looking at job settings, something has set mapred.reduce.tasks, I presume Hive. How does it choose that number?
Note: here are some messages while running a Hive job that should be a clue:
...
Number of reduce tasks not specified. Estimated from input data size: 500
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
...
The default of 1 maybe for a vanilla Hadoop install. Hive overrides it.
In open source hive (and EMR likely)
# reducers = (# bytes of input to mappers)
/ (hive.exec.reducers.bytes.per.reducer)
This post says default hive.exec.reducers.bytes.per.reducer is 1G.
You can limit the number of reducers produced by this heuristic using hive.exec.reducers.max
.
If you know exactly the number of reducers you want, you can set mapred.reduce.tasks
, and this will override all heuristics. (By default this is set to -1, indicating Hive should use its heuristics.)
In some cases - say 'select count(1) from T' - Hive will set the number of reducers to 1 , irrespective of the size of input data. These are called 'full aggregates' - and if the only thing that the query does is full aggregates - then the compiler knows that the data from the mappers is going to be reduced to trivial amount and there's no point running multiple reducers.