Querying sequences of rows in SQL

2020-06-13 13:17发布

Suppose I am storing events associated with users in a table as follows (with dt standing in for the timestamp of the event):

| dt | user | event |
|  1 |  1   |   A   |
|  2 |  1   |   D   |
|  3 |  1   |   B   |
|  4 |  1   |   C   |
|  5 |  1   |   B   |
|  6 |  2   |   B   |
|  7 |  2   |   B   |
|  8 |  2   |   A   |
|  9 |  2   |   A   |
| 10 |  2   |   C   |

Such that we could say:

  • user 1 has an event-sequence of ADBCB
  • user 2 has event-sequence BBAAC

The types of questions I would want to answer about these users are very easy to express as regular expresions on the event-sequences, e.g. "which users have an event-sequence matching A.*B?" or "which users have an event-sequence matching A[^C]*B[^C]*D?" etc.

What would be a good SQL technique or operator I could use to answer similar queries over this table structure?

Is there a way to efficiently/dynamically generate a table of user-to-event-sequence which could then be queried with regex?

I am currently looking at using Postgres, but I am curious to know if any of the bigger DBMS's like SQLServer or Oracle have specialized operators for this as well.

标签: sql regex
3条回答
看我几分像从前
2楼-- · 2020-06-13 13:27

I'm not at a computer to write code for this answer, but here's how I would go about a RegEx-based solution in SQL Server:

  1. Build a string from the resultset. Something like http://blog.sqlauthority.com/2009/11/25/sql-server-comma-separated-values-csv-from-table-column/ should work if you omit the comma
  2. Run your RegEx match against the resulting string. Unfortunately, SQL Server does not provide this functionality natively, however, you can use a CLR function for this purpose as described at http://www.ideaexcursion.com/2009/08/18/sql-server-regular-expression-clr-udf/

This should ultimately provide you with the functionality in SQL Server that your original question requests, however, if you're analyzing a very large dataset, this could be quite slow and there may be better ways to accomplish what you're looking for.

查看更多
唯我独甜
3楼-- · 2020-06-13 13:30

With Postgres 9.x this is actually quite easy:

select userid, 
       string_agg(event, '' order by dt) as event_sequence
from events
group by userid;

Using that result you can now apply a regular expression on the event_sequence:

select * 
from (
  select userid, 
         string_agg(event, '' order by dt) as event_sequence
  from events
  group by userid
) t
where event_sequence ~ 'A.*B'

With Postgres 8.x you need to find a replacement for the string_agg() function (just google for it, there are a lot of examples out there) and you need a sub-select to ensure the ordering of the aggregate as 8.x does support an order by in an aggregate function.

查看更多
Anthone
4楼-- · 2020-06-13 13:41

For Oracle (version 11g R2):

By chance if you are using Oracle DB 11g R2, take look at listagg. The below code should work, but I haven't tested. The point is: you can use listagg.

SQL> select user,
  2         listagg( event, '' ) 
  3         within group (order by dt) events
  4     from users
  5    group by user
  6    order by dt
  7   /

     USER   EVENTS
---------  --------------------
1          ADBCB
2          BBAAC

In prior versions you can do with CONNECT BY clause. More details on listagg.

查看更多
登录 后发表回答