Temporal Join in Hive query (events in close proxi

2019-02-27 07:51发布

问题:

I have a need for a hive query that I'm having difficulty figuring out.

I have a time series that looks like this:

time                          source    word1   word2    ...etc
2012-02-01 23:43:16.9988243    0001      2B3B    FAF0
2012-02-01 23:43:16.9993561    0002      2326    ABAA
2012-02-01 23:43:16.9998879    0002      2327    ABAA

I need a query such that, if a record in one source fulfills specific conditions, in addition to that record, it should also return one or more records in close proximity in time from the second source fulfilling a different set of conditions.

My attempt so far looks like this:

SELECT time
    FROM messages C
    JOIN messages D on
       D.time
           BETWEEN C.time - INTERVAL '0.001' SECOND
           AND     C.time + INTERVAL '0.001' SECOND

    WHERE C.source = '0001'
    AND   D.Source = '0002'
    AND   C.word1 = '2B3B'
    AND   D.word2 = 'ABAA'

Which should return the first and second records in the sample data above (It shouldn't return the third, because the time is farther away than .001 seconds).

But the query doesn't work. The error message is

FAILED: SemanticException '0.001' encountered with 0 children

回答1:

This would be a naive solution:

select  *

from                    messages c
        cross join      messages m 

where   m.time  between c.time - interval '0.001' second
                and     c.time + interval '0.001' second

    and c.word1 = '2B3B'
    and m.word2 = 'ABAA'

;

+----------------------------+--------+-------+-------+----------------------------+--------+-------+-------+
|            time            | source | word1 | word2 |            time            | source | word1 | word2 |
+----------------------------+--------+-------+-------+----------------------------+--------+-------+-------+
| 2012-02-01 23:43:16.998824 |   0001 | 2B3B  | FAF0  | 2012-02-01 23:43:16.999356 |   0002 |  2326 | ABAA  |
+----------------------------+--------+-------+-------+----------------------------+--------+-------+-------+

This is the solution with the good performance

select  *

from                    messages c

        join            messages m

        on              floor (cast(c.time as decimal(37,7)) / (2 * 0.001))   =
                        floor (cast(m.time as decimal(37,7)) / (2 * 0.001))

where   m.time  between c.time - interval '0.001' second
                and     c.time + interval '0.001' second

    and c.word1 = '2B3B'
    and m.word2 = 'ABAA'

union all

select  *

from                    messages c

        join            messages m

        on              floor ((cast(c.time as decimal(37,7)) + 0.001) / (2 * 0.001))   =
                        floor ((cast(m.time as decimal(37,7)) + 0.001) / (2 * 0.001))

where   floor (cast(c.time as decimal(37,7)) / (2 * 0.001))     <>
        floor (cast(m.time as decimal(37,7)) / (2 * 0.001))

    and m.time  between c.time - interval '0.001' second
                and     c.time + interval '0.001' second

    and c.word1 = '2B3B'
    and m.word2 = 'ABAA'

+----------------------------+--------+-------+-------+----------------------------+-------+-------+-------+
|            time            | source | word1 | word2 |           _col4            | _col5 | _col6 | _col7 |
+----------------------------+--------+-------+-------+----------------------------+-------+-------+-------+
| 2012-02-01 23:43:16.998824 |   0001 | 2B3B  | FAF0  | 2012-02-01 23:43:16.999356 |  0002 |  2326 | ABAA  |
+----------------------------+--------+-------+-------+----------------------------+-------+-------+-------+

Illustration

Events A and B are going to be caught by the upper part of the UNION ALL.
Events B and C are going to be caught by the lower part of the UNION ALL.

    0        0.002    0.004    0.006    0.008    0.01      
    |        |        |        |        |        |
-------------------------------------------------------
                      |        |
                      |        |
                          A  B  C
                           |        |
                           |        |
-------------------------------------------------------
         |        |        |        |        |                
         0.001    0.003    0.005    0.007    0.009


标签: sql hadoop hive