I have two tables, as follows (simplified from actual):
mysql> desc small_table; +-----------------+---------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +-----------------+---------------+------+-----+---------+-------+ | event_time | datetime | NO | | NULL | | | user_id | char(15) | NO | | NULL | | | other_data | int(11) | NO | MUL | NULL | | +-----------------+---------------+------+-----+---------+-------+ 3 rows in set (0.00 sec) mysql> desc large_table; +-----------------+---------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +-----------------+---------------+------+-----+---------+-------+ | event_time | datetime | NO | | NULL | | | user_id | char(15) | NO | | NULL | | | other_data | int(11) | NO | | NULL | | +-----------------+---------------+------+-----+---------+-------+ 3 rows in set (0.00 sec)
Now, small_table
is, well, small: for each user_id
there is usually only one row (though there are sometimes more). In large_table
, on the other hand, each user_id
appears numerous times.
mysql> select count(1) from small_table\G *************************** 1. row *************************** count(1): 20182 1 row in set (0.00 sec) mysql> select count(1) from large_table\G *************************** 1. row *************************** count(1): 2870522 1 row in set (0.00 sec)
However, and this is important, for each row in small_table
, there is at least one row in large_table
with the same user_id
, the same other_data
, and similar event_time
(the same within a few minutes, say).
I want to know whether small_table
has a row corresponding to the first, or the second, or the whateverth distinct row in large_table
for the same user_id
and similar event_time
. That is, I'd like:
- for each
user_id
, a count of distinct rows oflarge_table
in order byevent_time
, but only forevent_time
within, say, three hours; that is, I seek only the count of such rows as haveevent_time
within, say, three hours of one another; and - for each such collection of distinct rows, an identification of which row in that list (in order by
event_time
) has a corresponding row insmall_table
.
I don't seem able to write even a query that will do the first step, let alone a query that will do the second, and would appreciate any direction.
This should perhaps be a comment on Jonathan Leffler's detailed and helpful answer but (a) it's too long and (b) it does help answer my question, so I'm posting it as an answer.
The code titled "Multiple Event Ranges" in Jonathan Leffler's answer finds ranges where a second instance is soon after the first, and a penultimate instance is soon before the last, and no big breaks appear, but bars any big gap between interior instances, even if there are otherinstances between them. So, for example, if the limit is 3 hours, instances at 1, 2, 4, 6, and 7 would be barred because of the gap between 2 and 6. I think the correct code would instead be (building directly on Jonathan Leffler's):
which obviates the need for the last two
and exists
s in the code titled "Multiple Event Ranges" in Jonathan Leffler's answer and, indeed, obviates the need for the "Singleton ranges" and "Doubleton ranges" code in his answer.Unless I'm missing something.
I'm not sure what you're asking for in the small margin you mentioned.
also:
So, try:
The necessary SQL for this is brutal; it will give your optimizer a really rather serious workout.
Judging from the comments after the question as well as the question, the desire is to treat sequences of events for a given user ID in the large table as 'contiguous' if they all fall within some fixed interval between adjacent events. For the sake of example, the fixed interval will be 3 hours. I'm coding for IBM Informix Dynamic Server (for sake of argument, version 11.70, but 11.50 would also work fine). This means that there is an idiosyncratic notation I need to explain. Specifically, the notation
3 UNITS HOUR
denotes an interval of 3 hours; it could also be writtenINTERVAL(3) HOUR TO HOUR
in the Informix dialect of SQL, or asINTERVAL '3' HOUR
in standard SQL.There are a couple of crucial techniques in generating SQL, especially complex SQL. One is to build the SQL up in steps, piecemeal, assembling the final result. The other is to ensure that you have a clear specification of what it is you are after.
In the notation that follows, the qualification 'for the same User_ID' should be taken as always being part of the expression.
In the large table, there are three categories of range that we want to consider before joining with the small table.
As you can see from the description, this is going to be some scary SQL!
NB: The code has now been tested; some (mainly small) changes were necessary. One minor unnecessary change was the addition of ORDER BY clauses on intermediate queries. Another unnecessary change was to select the other data from the small table for verification purposes. This revision was made without studying the amended version posted by msh210.
Also note that I'm far from certain this is a minimal formulation; it may be feasible to classify all the ranges with a single SELECT statement instead of a UNION of three SELECT statements (and it would be good if that is the case).
Singleton ranges
Doubleton Ranges
Added 3 hour criterion to outer WHERE clause.
Multiple Event Ranges
Added omitted nested NOT EXISTS clause in the 'big gaps' sub-query.
All Ranges in Large Table
Clearly, the complete list of ranges in the last table are the union of the three queries above.
Query deleted as not interesting enough. It is simply the 3-way UNION of the separate queries above.
Final Query
The final query finds the ranges, if any, in the result of the gruesome 3-part UNION which is close enough to the entry in the small table. A single entry in the small table might fall at, say, 13:00, and there might be a range in the large table that ends at 11:00 and another that starts at 15:00. The two ranges from the large table are separate (the gap between them is 4 hours), but the entry in the small table is close enough to both to count. [The tests cover this case.]
OK - fair warning; the SQL has not actually been anywhere near an SQL DBMS.The code has now been tested. The infinitesimal chance was actually zero; there was a syntax error and a couple of more or less minor problems to resolve.
I experimented in stages after devising the test data. I used the 'Alpha' data (see below) while validating and fixing the queries, and added the Beta data only to ensure that there was no cross-talk between different User_ID values.
I used explicit
<
and>
operations rather thanBETWEEN ... AND
to exclude the end points; if you want events exactly 3 hours apart to count as 'close enough', then you need to review each inequality, possibly changing them toBETWEEN ... AND
or possibly just using>=
or<=
as appropriate.There's an answer to a loosely similar but rather simpler question that (a) I wrote and (b) provided some helpful thoughts on the complex processing above (in particular, the 'no event earlier but close enough' and 'no event later but close enough' criteria. The 'close enough' criteria most definitely complicate this question.
Test Data
Large Table
Small Table
Note: for the purposes of testing, the small table actually contains more rows than the large table. The rows in the small table with Other_Data values larger than 100 should not appear in the results (and don't). The tests here do poke at the edge conditions.
Final Query Results
Using the data above, the results obtained were:
Intermediate results
Slightly different formatting in effect.
Singleton ranges
Doubleton Ranges
Multiple Event Ranges