I have 3 RDDs that I need to join.
val event1001RDD: schemaRDD = [eventtype,id,location,date1]
[1001,4929102,LOC01,2015-01-20 10:44:39]
[1001,4929103,LOC02,2015-01-20 10:44:39]
[1001,4929104,LOC03,2015-01-20 10:44:39]
val event2009RDD: schemaRDD = [eventtype,id,celltype,date1] (not grouped by id since I need 4 dates from this depending on celltype)
[2009,4929101,R01,2015-01-20 20:44:39]
[2009,4929102,R02,2015-01-20 14:00:00] (RPM)
[2009,4929102,P01,2015-01-20 12:00:00] (PPM)
[2009,4929102,R03,2015-01-20 15:00:00] (RPM)
[2009,4929102,C01,2015-01-20 13:00:00] (RPM)
[2009,4929103,R01,2015-01-20 14:44:39]
[2009,4929105,R01,2015-01-20 12:44:39]
[2009,4929105,V01,2015-01-20 11:44:39]
[2009,4929106,R01,2015-01-20 13:44:39]
val cellLookupRDD: [celltype, cellname] (cellname has 4 values)
[R01,RPM]
[R02,RPM]
[R03,RPM]
[C01,RPM]
[P01,PPM]
[V01,PPM]
Expected result: [id,1001's location,1001's date1,2009's first RPM date,2009's last RPM date, 2009's first PPM date,2009's last PPM date]
4929101,NULL,NULL,2015-01-20 20:44:39,NULL,NULL,NULL
4929102,LOC01,2015-01-20 10:44:39,2015-01-20 13:00:00,2015-01-20 15:00:00,2015-01-20 12:00:00,NULL
4929103,LOC02,2015-01-20 10:44:39,2015-01-20 14:44:39,NULL,NULL,NULL
4929104,LOC03,2015-01-20 10:44:39,NULL,NULL,NULL,NULL
4929105,NULL,NULL,2015-01-20 12:44:39,NULL,2015-01-20 11:44:39,NULL
4929106,NULL,NULL,2015-01-20 13:44:39,NULL,NULL,NULL
This is my current query (where I am also indicating an optional eventtype as the first column; but in my previous event2009RDD, I am selecting a min and max date which is wrong since I need the four dates determined via cellLookupRDD - RPM and PPM):
select if(event1001Table.eventtype is not null, event1001Table.eventtype,
event2009Table.eventtype),
if(event1001Table.id is not null, event1001Table.id,
event2009Table.id),
event1001Table.date1, event2009Table.minDate, event2009Table.maxDate
from event1001Table full outer join event2009Table
on event1001Table.id=event2009Table.id")
EDITED to show result after applying answer:
" min(if(l.cn = 'RPM' or l.cn = 'RPM2', r.date1, 'NULL')) as rpmmn, " +
" max(if(l.cn = 'RPM' or l.cn = 'RPM2', r.date1, 'NULL')) as rpmmx, " +
" min(if(l.cn = 'PPM' or l.cn = 'PPM2', r.date1, 'NULL')) as ppmmn, " +
" max(if(l.cn = 'PPM' or l.cn = 'PPM2', r.date1, 'NULL')) as ppmmx " +
[2009,4929102,R01,2015-01-20 13:00:00] min_rpm
[2009,4929102,C01,2015-01-20 14:00:00] max_rpm
---res: [2009,4929102,NULL,NULL,2015-01-20 13:00:00,2015-01-20 14:00:00]
--- CORRECT
[2009,4929102,R01,2015-01-20 13:00:00] min_rpm
[2009,4929102,P01,2015-01-20 14:00:00] min_ppm
---res: [2009,4929102,2015-01-20 13:00:00,NULL,2015-01-20 14:00:00,NULL]
--- INCORRECT (max should be equal to MIN although NULL is preferred if possible but I could just check in the code later on if min=max)
[2009,4929102,R01,2015-01-20 13:00:00] min_rpm
[2009,4929102,C01,2015-01-20 14:00:00] max_rpm
[2009,4929102,P01,2015-01-20 09:00:00] min_ppm
---res: [2009,4929102,2015-01-20 13:00:00,NULL,2015-01-20 09:00:00,NULL]
--- INCORRECT (max is not working)
Lets work it step by step. Lets first construct the 2009 part
Now you can do a full outer join with 1001 data set and get output.
Note: you should not have
instead, you should have
Because, if 2009 event have occurred once, then it should have both first and last date. NULL should represent an event never occurred, like for id=4929101, celltype=PPM.
Please let me know if this works (or not). I do not have access to spark right this moment, but should be able to debug, if needed, tonight.