可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
Let's say I have a table with two columns: start
and end
, both integers, and the table is ordered by the first, then second column. Each row represents an interval.
What I need is the table of merged intervals: all overlapping or adjacent intervals gobbled up into one.
It can be constructed with a JOIN query, but that is quadratic in the number of rows, which is 4 million rows in my case (I decided to compose this question because the query is still running).
It can also be done in a single pass, by running through each row and keeping track of the maximum end time - but how to do that, or something equivalent, in standard SQL? Is there any O(n) way to do it in SQL? I'm using SQLite right now; a SQLite-specific solution would also help me out this time.
From the answers to related questions (1, 2, 3, 4, 5, 6, 7, 8, 9) I can't tell whether it's possible.
Can you?
回答1:
Well, here is a solution that works in MySQL (I don't know if it will work in SQlite). I think, but cannot prove, that is O(n) (discarding the time it takes to sort the events table initially, i.e. if it is already sorted as I think the question states.)
> SELECT * from events;
+-------+-----+
| start | end |
+-------+-----+
| 1 | 9 |
| 5 | 8 |
| 8 | 11 |
| 11 | 13 |
| 17 | 25 |
| 18 | 26 |
| 33 | 42 |
| 59 | 81 |
| 61 | 87 |
| 97 | 132 |
| 105 | 191 |
| 107 | 240 |
| 198 | 213 |
| 202 | 215 |
+-------+-----+
14 rows in set (0.00 sec)
SET @interval_id = 0;
SET @interval_end = 0;
SELECT
MIN(start) AS start,
MAX(end) AS end
FROM
(SELECT
@interval_id := IF(start > @interval_end,
@interval_id + 1,
@interval_id) AS interval_id,
@interval_end := IF(start < @interval_end,
GREATEST(@interval_end, end),
end) AS interval_end,
events.*
FROM events
ORDER BY start,end) tmp
GROUP BY interval_id;
+-------+------+
| start | end |
+-------+------+
| 1 | 13 |
| 17 | 26 |
| 33 | 42 |
| 59 | 87 |
| 97 | 240 |
+-------+------+
5 rows in set (0.00 sec)
回答2:
In your links you have omitted one: Can I use a SQL Server CTE to merge intersecting dates? where I present a RECURSIVE CTE solution to the overlapping intervals problem. Recursive CTE's can be handled differently (compared to ordinary self-joins), and often perform amazingly fast.
mysql does not have recursive CTEs. Postgres has them, Oracle has them, Microsoft has them.
Here Querying for a 'run' of consecutive columns in Postgres is another one, with a fudge-factor.
Here Get total time interval from multiple rows if sequence not broken is yet another one.
回答3:
Based on the answer to my question in the comments, I don't think my idea would have worked. since you mentioned you it can (and I assume you know how) be done with joins, I had an idea of minimizing the number of rows to be joined by keeping only ranges that belong to distinct points like the following:
select start, max(end) as end
from (
select min(start) as start,end
from table
group by end
) in_tab
group by in_tab.start
the above inner select makes sure that no end point repeats and selects the longest start point for each end. the outer select does just the opposite. we end up with ranges that start and end at different points (with any FULLY contained/overlapped range removed).
This might have worked if the max range was not big. if these were dates and there is maximum a year difference between lowest date in the whole table and highest date in it, then it would have been 365*364 options to pick any two points and that would have been the higher limit to the possible rows after the above select. these then could have been used in a temp table using the join method that you already have. but with the numbers you mentioned, then theoretically we have a huge number that makes this try irrelevant. even though the above minimizes the rows to be used in the calculation, they would still be too much to use in a join.
I do not know of a way to make this in ANSI SQL without joins when there is no other non standard functionality provided by the RDBMS. for example in oracle this can easily be achieved with analytical functions. the best would be in this case to use the above to minimize the number of rows used and bring them to your application and there you can write code that calculates the ranges and insert them back in the database.
回答4:
For now, the best answer I've found is: use indexing.
This brings the complexity down from quadratic to O(n log n).
With a covering index, the queries turned out to be fast enough for my needs;
with just an index on either the start or end column, it was slower but still OK.
In each case, EXPLAIN QUERY PLAN
told me that a single table scan is combined with use of the index, as expected.
Finding an element in the index isn't quite O(1), but turned out to be close enough.
And building the index isn't slow, either.
What remains is the proof that a true O(n) algorithm can't be written in SQL.
So another answer is to write it in a different language and then apply it to a SQLite table.
There are various ways to make that work:
- export the table to a CSV file; read the CSV file, apply the algorithm, produce CSV; import the resulting CSV file as a table;
- use a SQLite driver for that language (e.g. DBD::SQLite for Perl, RSQLite for R)
- write a SQLite extension function that somehow interfaces with the language of choice