I have a need for a hive query that I'm having difficulty figuring out.
I have a time series that looks like this:
time source word1 word2 ...etc
2012-02-01 23:43:16.9988243 0001 2B3B FAF0
2012-02-01 23:43:16.9993561 0002 2326 ABAA
2012-02-01 23:43:16.9998879 0002 2327 ABAA
I need a query such that, if a record in one source
fulfills specific conditions, in addition to that record, it should also return one or more records in close proximity in time from the second source
fulfilling a different set of conditions.
My attempt so far looks like this:
SELECT time
FROM messages C
JOIN messages D on
D.time
BETWEEN C.time - INTERVAL '0.001' SECOND
AND C.time + INTERVAL '0.001' SECOND
WHERE C.source = '0001'
AND D.Source = '0002'
AND C.word1 = '2B3B'
AND D.word2 = 'ABAA'
Which should return the first and second records in the sample data above (It shouldn't return the third, because the time is farther away than .001 seconds).
But the query doesn't work. The error message is
FAILED: SemanticException '0.001' encountered with 0 children
This would be a naive solution:
select *
from messages c
cross join messages m
where m.time between c.time - interval '0.001' second
and c.time + interval '0.001' second
and c.word1 = '2B3B'
and m.word2 = 'ABAA'
;
+----------------------------+--------+-------+-------+----------------------------+--------+-------+-------+
| time | source | word1 | word2 | time | source | word1 | word2 |
+----------------------------+--------+-------+-------+----------------------------+--------+-------+-------+
| 2012-02-01 23:43:16.998824 | 0001 | 2B3B | FAF0 | 2012-02-01 23:43:16.999356 | 0002 | 2326 | ABAA |
+----------------------------+--------+-------+-------+----------------------------+--------+-------+-------+
This is the solution with the good performance
select *
from messages c
join messages m
on floor (cast(c.time as decimal(37,7)) / (2 * 0.001)) =
floor (cast(m.time as decimal(37,7)) / (2 * 0.001))
where m.time between c.time - interval '0.001' second
and c.time + interval '0.001' second
and c.word1 = '2B3B'
and m.word2 = 'ABAA'
union all
select *
from messages c
join messages m
on floor ((cast(c.time as decimal(37,7)) + 0.001) / (2 * 0.001)) =
floor ((cast(m.time as decimal(37,7)) + 0.001) / (2 * 0.001))
where floor (cast(c.time as decimal(37,7)) / (2 * 0.001)) <>
floor (cast(m.time as decimal(37,7)) / (2 * 0.001))
and m.time between c.time - interval '0.001' second
and c.time + interval '0.001' second
and c.word1 = '2B3B'
and m.word2 = 'ABAA'
+----------------------------+--------+-------+-------+----------------------------+-------+-------+-------+
| time | source | word1 | word2 | _col4 | _col5 | _col6 | _col7 |
+----------------------------+--------+-------+-------+----------------------------+-------+-------+-------+
| 2012-02-01 23:43:16.998824 | 0001 | 2B3B | FAF0 | 2012-02-01 23:43:16.999356 | 0002 | 2326 | ABAA |
+----------------------------+--------+-------+-------+----------------------------+-------+-------+-------+
Illustration
Events A and B are going to be caught by the upper part of the UNION ALL.
Events B and C are going to be caught by the lower part of the UNION ALL.
0 0.002 0.004 0.006 0.008 0.01
| | | | | |
-------------------------------------------------------
| |
| |
A B C
| |
| |
-------------------------------------------------------
| | | | |
0.001 0.003 0.005 0.007 0.009