I am designing some system that will store records containing a begin and end time. For example:
CREATE TABLE test (
id bigserial PRIMARY KEY,
ts_start timestamp NOT NULL,
ts_end timestamp NOT NULL,
foo bar NOT NULL,
...
);
Now I want to run queries on this to find all rows that overlap with a certain timestamp. This would result in a where clause like:
WHERE ts_start <= '2006-4-6 12:34:56' AND ts_end > '2006-4-6 12:34:56'
I did test this with a massive amount of generated test data and the performance is quite bad. I tested it with an index on ts_start and another index on ts_end and also with an multi column index on ts_start and ts_end. The last gave the best result but it is still far from optimal.
The problem is that postgresql doesn't know the fact that ts_end is guaranteed to be larger then ts_start so it uses a plan that is capable of finding rows where ts_end is smaller then ts_start.
Any suggestions how to solve this problem?
Edit: For people having this problem too if you can wait a little longer then PostgreSQL 9.2 has the perfect solution: range types. 9.2 is in beta now final release will most likely be at the end of 2012.
There was "temporal postgres" (google it) but I don't know if it is still maintained... I believe there was a discussion of including this type of search into postgres but I don't remember the final state of it. Anyway :
Example using box and gist :
As you can see gist index is ridiculously fast here (1500 times ! lol) (and you can use many operators like overlaps, is contained, contains, etc.
http://www.postgresql.org/docs/8.2/static/functions-geometry.html
Your encountering the same problem as someone trying to index line segments and then query if a point is in the segment. You just can't do that by indexing each dimension separately and you need something that indexes by constructing some sort of BSP structure.
I'm not sure if PG has any built in data type to support date ranges, but I'm certain that if you use PostGIS to represent the time ranges as points in 2D space and then tell PG to geo-index that, you'll get the best possible performance out of this query.
Maybe there's a date-specific equivalent of my suggestion built into pg, but, as I said, I'm not familiar with it. I am familiar with the geometrical indexing capabilities of pg, though, and I think you should consider it seriously as an optimization.
Here's a naive example (although I'm sure it will be very fast to query):
illustration:
In situations like this you need to re-express your query so as to tell it to Postgres.
This much like you'd do when querying against lft/rgt in a nested set: if you've a tree indexed using lft/rgt in such a way that children have
parent_lft < lft
andlft < rgt
andparent_lft < parent_rgt
, the optimal query will rely onparent_lft < lft
andlft < parent_rgt
(which looks up the index onlft
for a small range) rather thanparent_lft < lft
andrgt < parent_rgt
(which looks up the index onlft
from one point onward).You're in a similar situation when you're adding an index. Unless you constrain either or both of ts_start and ts_end, you're going to look at a large set of rows.
For that particular query, you might want to look into geometry types and using a GIST index.
Specifically, if you floor ts_start and ceil ts_end to midnight, you can get an integer representation (number of days since epoch, for instance). Then store the latter as an indexable type and query it using an overlap condition.
As a side note, there was some discussion on adding some kind of timestamp segment/event type in recent months on the pg-hackers list, but I'm miserably failing to find the relevant references through googling. So... mentioning it here, in case you're luckier than I was.