Multiple Input with multiple mappers using MRJob

2019-08-24 08:29发布

问题:

Is it possible to implement the option of multiple inputs with different mapper for each as available in Hadoop using mrjob? If so, an example or any link to documentation would be helpful.

EDIT: I am trying to implement an example like in this question: Hadoop multiple inputs. The only difference being I want to do it using MRJob library as I have to work with Python.

I have data coming in on a daily basis. I will compute some summary at a day level for a source for day 1 A with a format:

phone_number,call_minutes,datetime_of_event

leading to an output B such as:

phone_number(delimiter)month_of_year total_call_minutes

On day 2, I get A for new datetime info. Now I want to provide Day 1's B and Day 2's A to two different mappers (Mapper M1 and M2 respectively) of the same job to handle the different formats with the output of the mappers having similar key/value format. This will me Day 2's B which is a cumulative summary of day 1 and 2 together. This form will continue on a daily basis.

I would like to know if this can be done via MRJob or any other python based library for hadoop.

PS: I think I can achieve this, using a single mapper by using an additional field in both the input and output as a source type indicator and handle each record accordingly. But I am not very keen on using that method. Which is why I am looking for this option which I feel is a much cleaner approach.