Efficiently joining over interval ranges in SQL

2019-02-24 19:12发布

问题:

Suppose I have two tables as follows (data taken from this SO post):

Table d1:

 x start end
 a     1   3
 b     5  11
 c    19  22
 d    30  39
 e     7  25

Table d2:

 x pos
 a   2
 a   3
 b   3
 b  12
 c  20
 d  52
 e  10

The first row in both tables are column headers. I'd like to extract all the rows in d2 where column x matches with d1 and pos1 falls within (including boundary values) d1's start and end columns. That is, I'd like the result:

 x pos start  end
 a   2     1    3
 a   3     1    3
 c  20    19   22
 e  10     7   25

The way I've seen this done so far is:

SELECT * FROM d1 JOIN d2 USING (x) WHERE pos BETWEEN start AND end

But what is not clear to me is if this operation is done as efficient as it can be (i.e., optimised internally). For example, computing the entire join first is not really a scalable approach IMHO (in terms of both speed and memory).

Are there any other efficient query optimisations (ex: using interval trees) or other algorithms that can handle ranges efficiently (again, in terms of both speed and memory) in SQL that I can make use of? It doesn't matter if it's using SQLite, PostgreSQL, mySQL etc..

What is the most efficient way to perform this operation in SQL?

Thank you very much.

回答1:

Not sure how it all works out internally, but depending on the situation I would advice to play around with a table that 'rolls out' all the values from d1 and then join on that one. This way the query engine can pinpoint the right record 'exactly' instead of having to find a combination of boundaries that match the value being looked for.

e.g.

x value
a  1
a  2
a  3
b  5
b  6
b  7
b  8
b  9
b 10
b 11
c 19 etc..

given an index on the value column (**), this should be quite a bit faster than joining with the BETWEEN start AND end on the original d1 table IMHO.

Off course, each time you make changes to d1, you'll need to adjust the rolled out table too (trigger?). If this happens frequently you'll spend more time updating the rolled out table than you gained in the first place! Additionally, this might take quite a bit of (disk)space quickly if some of the intervals are really big; and also, this assumes we don't need to look for non-whole numbers (e.g. what if we look for the value 3.14 ?)

(You might consider experimenting with a unique one on (value, x) here...)