What are the priorities of the following 3 options for setting number of reduces? In other words, if all three are set, which one will be taken into account?
Option1:
setNumReduceTasks(2) within the application code
Option2:
-D mapreduce.job.reduces=2 as command line argument
Option3:
through $HADOOP_CONF_DIR/mapred-site.xml file
<property>
<name>mapreduce.job.reduces</name>
<value>2</value>
</property>
You have them racked in priority order - option 1 will override 2, and 2 will override 3. In other words Option 1 will be the one used by your job in this scenario
According to the Hadoop - The Definitive Guide
The -D option is used to set the configuration property with key color to the value
yellow. Options specified with -D take priority over properties from the configuration
files. This is very useful because you can put defaults into configuration files and then
override them with the -D option as needed. A common example of this is setting the
number of reducers for a MapReduce job via -D mapred.reduce.tasks=n. This will
override the number of reducers set on the cluster or set in any client-side configuration
files.
First Priority: Passing configuration parameters through command line (while submitting MR Application)
Second Priority: Setting configuration parameters in application code
Third Priority: It will read default parameters from multiple xml files such as core-site.xml, hadoop-env.sh, hdfs-site.xml, log4j.properties and mapred-site.xml