I am working on a map-reduce job, consisting multiple steps. Using mrjob every step receives previous step output. The problem is I don't want it to.
What I want is to extract some information and use it in second step against all input and so on. Is it possible to do this using mrjob?
Note: Since I don't want to use emr, this question is not much of help to me.
UPDATE: If it would not be possible to do this on a single job, I need to do it in two separate jobs. In this case, is there any way to wrap these two jobs and manage intermediate outpus, etc?
You can use Runners
You will have to define the jobs separately and use another python script to invoke it.
from NumLines import NumLines
from WordsPerLine import WordsPerLine
import sys
intermediate = None
def firstJob(input_file):
global intermediate
mr_job = NumLines(args=[input_file])
with mr_job.make_runner() as runner:
runner.run()
intermediate = runner.get_output_dir()
def secondJob(input_file):
mr_job = WordsPerLine(args=[intermediate,input_file])
with mr_job.make_runner() as runner:
runner.run()
if __name__ == '__main__':
firstJob(sys.argv[1])
secondJob(sys.argv[1])
and can be invoked by:
python main_script.py input.txt