Suppose I am storing events
associated with users
in a table as follows (with dt
standing in for the timestamp of the event):
| dt | user | event |
| 1 | 1 | A |
| 2 | 1 | D |
| 3 | 1 | B |
| 4 | 1 | C |
| 5 | 1 | B |
| 6 | 2 | B |
| 7 | 2 | B |
| 8 | 2 | A |
| 9 | 2 | A |
| 10 | 2 | C |
Such that we could say:
- user 1 has an event-sequence of ADBCB
- user 2 has event-sequence BBAAC
The types of questions I would want to answer about these users are very easy to express as regular expresions on the event-sequences, e.g. "which users have an event-sequence matching A.*B?" or "which users have an event-sequence matching A[^C]*B[^C]*D?" etc.
What would be a good SQL technique or operator I could use to answer similar queries over this table structure?
Is there a way to efficiently/dynamically generate a table of user
-to-event-sequence
which could then be queried with regex?
I am currently looking at using Postgres, but I am curious to know if any of the bigger DBMS's like SQLServer or Oracle have specialized operators for this as well.
I'm not at a computer to write code for this answer, but here's how I would go about a RegEx-based solution in SQL Server:
This should ultimately provide you with the functionality in SQL Server that your original question requests, however, if you're analyzing a very large dataset, this could be quite slow and there may be better ways to accomplish what you're looking for.
With Postgres 9.x this is actually quite easy:
Using that result you can now apply a regular expression on the event_sequence:
With Postgres 8.x you need to find a replacement for the string_agg() function (just google for it, there are a lot of examples out there) and you need a sub-select to ensure the ordering of the aggregate as 8.x does support an
order by
in an aggregate function.For Oracle (version 11g R2):
By chance if you are using Oracle DB 11g R2, take look at listagg. The below code should work, but I haven't tested. The point is: you can use
listagg
.In prior versions you can do with CONNECT BY clause. More details on listagg.