I'm very new to Hadoop and I'm currently trying to join two sources of data where the key is an interval (say [date-begin/date-end]). For example:
input1:
20091001-20091002 A
20091011-20091104 B
20080111-20091103 C
(...)
input2:
20090902-20091003 D
20081015-20091204 E
20040011-20050101 F
(...)
I'd like to find all the records where the key1 overlaps the key2. Is it possible with hadoop ? Where can I find an example of implementation ?
Thanks.
A solution was given on Biostar: http://biostar.stackexchange.com/questions/8821
I think all that's needed is a key class where hashCode() and equals() do what you want them to do. I suspect that you might encounter a problem where A overlaps B (i.e. A.equals(B) == true), B overlaps C, but C doesn't overlap A. If you implement such an equals() method, you'll probably get strange behaviour.
Basically, you want to do something like stabbing queries on a Segment Tree (i.e. for all overlapping intervals E for an interval (p1.start, p1.end), perform stabbing queries for p1.start and p1.end).
But basically, no, I don't know a correct answer to your question. But maybe a query for "Segment tree" hadoop will get you started.