For a python Hadoop streaming job, how do I pass a parameter to, for example, the reducer script so that it behaves different based on the parameter being passed in?
I understand that streaming jobs are called in the format of:
hadoop jar hadoop-streaming.jar -input -output -mapper mapper.py -reducer reducer.py ...
I want to affect reducer.py.
The argument to the command line option -reducer
can be any command, so you can try:
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input inputDirs \
-output outputDir \
-mapper myMapper.py \
-reducer 'myReducer.py 1 2 3' \
-file myMapper.py \
-file myReducer.py
assuming myReducer.py
is made executable. Disclaimer: I have not tried it, but I have passed similar complex strings to -mapper
and -reducer
before.
That said, have you tried the
-cmdenv name=value
option, and just have your Python reducer get its value from the environment? It's just another way to do things.
In your Python code,
import os
(...)
os.environ["PARAM_OPT"]
In your Hapdoop command include:
hadoop jar \
(...)
-cmdenv PARAM_OPT=value\
(...)
If you are using python you may want to check out dumbo which provides a nice wrapper around hadoop streaming.
In dumbo you pass parameters with -param as in :
dumbo start yourpython.py -hadoop <hadoop-path> -input <input> -output <output> -param <parameter>=<value>
And then read it in the reducer
def reducer:
def __init__(self):
self.parmeter = int(self.params["<parameter>"])
def __call__(self, key, values):
do something interesting ...
You can read more in the dumbo tutorial
You can -reducer
as the below command
hadoop jar hadoop-streaming.jar \
-mapper 'count_mapper.py arg1 arg2' -file count_mapper.py \
-reducer 'count_reducer.py arg3' -file count_reducer.py \
you can revise this
Link