I have a huge database table with n integer intervals (for instance {1-5}, {4-16}, {6434-114343}) and need to find out which intervals overlap each other. There's a wealth of similar questions on SO, but the difference is I need returned, for each interval respectively, the set of overlapping intervals.
------------------ A -------------------
------ B ------- ----- D -----
--------- C ---------
For this example, output would be A:{B,C,D} B:{A,C} C:{A,B} D:{A}
Worst case, all intervals could overlap each other, producing an output of size O(n2). This is no better than the naïve solution (compare every pair of intervals). In practice however, I know few of my intervals will overlap other intervals, and when they do, only up to 5 other intervals.
Given this info, how should I solve the problem? (Optimally, I would like a SQL query solution since the data is in the database, but I think only a regular algorithmic solution is possible.)
Build a sorted sequence of interval starts and ends, then traverse it, every time updating list of current intervals, report any new found intersection.
Something like this:
std::vector<TBorder> borders;
for(auto i=intervals.begin();i!=intervals.end();++i)
{
borders.push_back(TBorder(i.Start(),Start));
borders.push_back(TBorder(i.End(),End));
}
std::sort(borders.begin(),borders.end());
std::set<int> currentIntervals;
for(auto b=borders.begin();b!=borders.end();++b)
{
if(b.IsEnd())
currentIntervals.erase(b.IntervalIndex());
else
{
currentIntervals.insert(b.IntervalIndex());
if(currentIntervals.size()>1)
ReportIntersection(currentIntervals);
}
}
Generally it's O(n*log n) (assuming number of intersections is O(1) ).
But if you already have intervals sorted by e.g. start, likely sorting can be done in O(n) (again assuming number of intersection is O(1)).
The typical programmatic solution for your problem is building an interval tree out of all ranges and then performing one lookup per interval, which gives you a list of all intersecting intervals in O(log n)
time. Here's a sample what such an interval tree looks like:
In you case, though, you would store the primary keys in the tree nodes as well, so given the following dates (finding overlapping dates is a common problem that can be solved with interval trees)
your tree would look like this
So if I want to know which intervals overlap with C, I search for the start of C, 1843, and the tree tells me, that this value is only within the interval C, which is the interval I'm testing for, so I can ignore it. Then I search for the end of C, 1907, and the tree tells me, that it is in the intervals A, B, and C, again I can ignore C, and thus my result set is A and B.
I admit, the lookup in such a tree is not that intuitive as one might expect. I'll try to explain it here as good as I can: You start at the top root node and at each node decide to either go left or go right until you hit a leave node (a node that has no children any more). If the node value is bigger than the value you are searching for, you go left. If the node value is smaller than the value you are searching for, you go to the right. What if you the node value equals exactly the value you are searching for? It depends! If you are searching for the start of an interval, equal value means you go right, if you search for the end of an interval, equal value means you go left. This is very important. Once you reached a leave node, you are done and all intervals you found in any node on your way to that leave node, including the interval stored in the leave node itself (if any) make up your result set, not only the interval stored in the leave node. That means you must collect any intervals you come across while performing your search.
Now back to your initial question: Can all this be done in SQL? Yes, it can be done. I'm not sure if you really want to do it, though. You can transform your current SQL table data into a SQL table that represents an interval tree and then directly perform the lookups in that interval tree table. At least I found someone who did exactly that. He tries to find all date ranges the cover a given date without having to compare the date against all existing ranges in the database:
A Static Relational Interval Tree
He even uses a nifty trick to optimize lookups for speed, cutting down CPU usage dramatically for both, building the lookup table and performing the actual lookups (which makes the whole thing quite complicated).