How to specifically determine input for each map s

2019-04-10 06:14发布

问题:

I am working on a map-reduce job, consisting multiple steps. Using mrjob every step receives previous step output. The problem is I don't want it to.

What I want is to extract some information and use it in second step against all input and so on. Is it possible to do this using mrjob?

Note: Since I don't want to use emr, this question is not much of help to me.

UPDATE: If it would not be possible to do this on a single job, I need to do it in two separate jobs. In this case, is there any way to wrap these two jobs and manage intermediate outpus, etc?

回答1:

You can use Runners

You will have to define the jobs separately and use another python script to invoke it.

from NumLines import NumLines
from WordsPerLine import WordsPerLine
import sys

intermediate = None

def firstJob(input_file):
    global intermediate
    mr_job = NumLines(args=[input_file])
    with mr_job.make_runner() as runner:
        runner.run()
        intermediate = runner.get_output_dir()

def secondJob(input_file):
    mr_job = WordsPerLine(args=[intermediate,input_file])
    with mr_job.make_runner() as runner:
        runner.run()

if __name__ == '__main__':
    firstJob(sys.argv[1]) 
    secondJob(sys.argv[1])

and can be invoked by:

python main_script.py input.txt