Is it possible to implement the option of multiple inputs with different mapper for each as available in Hadoop
using mrjob
? If so, an example or any link to documentation would be helpful.
EDIT:
I am trying to implement an example like in this question: Hadoop multiple inputs. The only difference being I want to do it using MRJob
library as I have to work with Python
.
I have data coming in on a daily basis. I will compute some summary at a day level for a source for day 1 A
with a format:
phone_number,call_minutes,datetime_of_event
leading to an output B
such as:
phone_number(delimiter)month_of_year total_call_minutes
On day 2, I get A
for new datetime info. Now I want to provide Day 1's B
and Day 2's A
to two different mappers (Mapper M1 and M2 respectively) of the same job to handle the different formats with the output of the mappers having similar key/value format. This will me Day 2's B
which is a cumulative summary of day 1 and 2 together. This form will continue on a daily basis.
I would like to know if this can be done via MRJob or any other python based library for hadoop.
PS: I think I can achieve this, using a single mapper by using an additional field in both the input and output as a source type indicator and handle each record accordingly. But I am not very keen on using that method. Which is why I am looking for this option which I feel is a much cleaner approach.