I have a large table containing time series sensor data. Large is anything from a few thousand up to 10M record divided amongst various channels being monitored. For a certain sensor type I need to calculate the time interval between the current and previous reading, i.e. find the largest timestamp prior to the current one.
The obvious approaches come to mind, each measured on Core i5 for a channel of 40k entries:
Correlated subquery
SELECT collect.*, prev.timestamp AS prev_timestamp
FROM data AS collect
LEFT JOIN data AS prev ON prev.channel_id = collect.channel_id AND prev.timestamp = (
SELECT MAX(timestamp)
FROM data
WHERE data.channel_id = collect.channel_id AND data.timestamp < collect.timestamp
)
WHERE collect.channel_id=14 AND collect.timestamp >= 0
ORDER BY collect.timestamp
Time (exec, fetch) 11sec, 21sec
Plan
+----+--------------------+---------+------+------------------------------+---------+---------+-------------------------+-------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+---------+------+------------------------------+---------+---------+-------------------------+-------+--------------------------+
| 1 | PRIMARY | collect | ref | ts_uniq,IDX_ADF3F36372F5A1AA | ts_uniq | 5 | const | 45820 | Using where |
| 1 | PRIMARY | prev | ref | ts_uniq,IDX_ADF3F36372F5A1AA | ts_uniq | 13 | const,func | 1 | Using index |
| 2 | DEPENDENT SUBQUERY | data | ref | ts_uniq,IDX_ADF3F36372F5A1AA | ts_uniq | 5 | nils.collect.channel_id | 2495 | Using where; Using index |
+----+--------------------+---------+------+------------------------------+---------+---------+-------------------------+-------+--------------------------+
Anti Join
SELECT d1.*, d2.timestamp AS prev_timestamp
FROM data d1
LEFT JOIN data d2 ON
d2.channel_id=14 AND
d2.timestamp < d1.timestamp
LEFT JOIN data d3 ON
d3.channel_id=14 AND
d3.timestamp < d1.timestamp AND
d3.timestamp > d2.timestamp
WHERE
d3.timestamp IS NULL AND
d1.channel_id=14
ORDER BY timestamp
Time 12sec, 21sec
Plan
+----+-------------+-------+------+------------------------------+---------+---------+-------+-------+--------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+------------------------------+---------+---------+-------+-------+--------------------------------------+
| 1 | SIMPLE | d1 | ref | ts_uniq,IDX_ADF3F36372F5A1AA | ts_uniq | 5 | const | 45820 | Using where |
| 1 | SIMPLE | d2 | ref | ts_uniq,IDX_ADF3F36372F5A1AA | ts_uniq | 5 | const | 47194 | Using index |
| 1 | SIMPLE | d3 | ref | ts_uniq,IDX_ADF3F36372F5A1AA | ts_uniq | 5 | const | 47194 | Using where; Using index; Not exists |
+----+-------------+-------+------+------------------------------+---------+---------+-------+-------+--------------------------------------+
And I've come up with another pattern I'm calling Naive Count
SELECT current.*, prev.timestamp AS prev_timestamp FROM
(
SELECT data.*, @r1 := @r1+1 AS rownum from data
CROSS JOIN (SELECT @r1 := 0) AS vars
WHERE channel_id=14
ORDER BY timestamp
) AS current
LEFT JOIN
(
SELECT data.*, @r2 := @r2+1 AS rownum from data
CROSS JOIN (SELECT @r2 := 0) AS vars
WHERE channel_id=14
ORDER BY timestamp
) AS prev
ON current.rownum = prev.rownum+1
Time 1.1sec (this one is actually fastest!)
Plan
+----+-------------+------------+--------+------------------------------+---------+---------+-----+-------+----------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+--------+------------------------------+---------+---------+-----+-------+----------------+
| 1 | PRIMARY | <derived2> | ALL | | | | | 24475 | |
| 1 | PRIMARY | <derived4> | ALL | | | | | 24475 | |
| 4 | DERIVED | <derived5> | system | | | | | 1 | |
| 4 | DERIVED | data | ref | ts_uniq,IDX_ADF3F36372F5A1AA | ts_uniq | 5 | | 45820 | Using where |
| 5 | DERIVED | | | | | | | | No tables used |
| 2 | DERIVED | <derived3> | system | | | | | 1 | |
| 2 | DERIVED | data | ref | ts_uniq,IDX_ADF3F36372F5A1AA | ts_uniq | 5 | | 45820 | Using where |
| 3 | DERIVED | | | | | | | | No tables used |
+----+-------------+------------+--------+------------------------------+---------+---------+-----+-------+----------------+
As the query likely runs on small platforms like the RasPi performance is critical- a couple seconds are highest acceptable.
My question: Is the last approach a good one for greatest-n-per-group or are there better ones? Is it expected that correlated subquery is as slow as experienced?