Querying sequences of rows in SQL

Suppose I am storing events associated with users in a table as follows (with dt standing in for the timestamp of the event):

| dt | user | event |
|  1 |  1   |   A   |
|  2 |  1   |   D   |
|  3 |  1   |   B   |
|  4 |  1   |   C   |
|  5 |  1   |   B   |
|  6 |  2   |   B   |
|  7 |  2   |   B   |
|  8 |  2   |   A   |
|  9 |  2   |   A   |
| 10 |  2   |   C   |

Such that we could say:

user 1 has an event-sequence of ADBCB
user 2 has event-sequence BBAAC

The types of questions I would want to answer about these users are very easy to express as regular expresions on the event-sequences, e.g. "which users have an event-sequence matching A.*B?" or "which users have an event-sequence matching A[^C]*B[^C]*D?" etc.

What would be a good SQL technique or operator I could use to answer similar queries over this table structure?

Is there a way to efficiently/dynamically generate a table of user-to-event-sequence which could then be queried with regex?

I am currently looking at using Postgres, but I am curious to know if any of the bigger DBMS's like SQLServer or Oracle have specialized operators for this as well.

标签： sql regex

3条回答

看我几分像从前

2楼-- · 2020-06-13 13:27

I'm not at a computer to write code for this answer, but here's how I would go about a RegEx-based solution in SQL Server:

Build a string from the resultset. Something like http://blog.sqlauthority.com/2009/11/25/sql-server-comma-separated-values-csv-from-table-column/ should work if you omit the comma
Run your RegEx match against the resulting string. Unfortunately, SQL Server does not provide this functionality natively, however, you can use a CLR function for this purpose as described at http://www.ideaexcursion.com/2009/08/18/sql-server-regular-expression-clr-udf/

This should ultimately provide you with the functionality in SQL Server that your original question requests, however, if you're analyzing a very large dataset, this could be quite slow and there may be better ways to accomplish what you're looking for.

0人赞添加讨论(0) 举报

唯我独甜

3楼-- · 2020-06-13 13:30

With Postgres 9.x this is actually quite easy:

select userid, 
       string_agg(event, '' order by dt) as event_sequence
from events
group by userid;

Using that result you can now apply a regular expression on the event_sequence:

select * 
from (
  select userid, 
         string_agg(event, '' order by dt) as event_sequence
  from events
  group by userid
) t
where event_sequence ~ 'A.*B'

With Postgres 8.x you need to find a replacement for the string_agg() function (just google for it, there are a lot of examples out there) and you need a sub-select to ensure the ordering of the aggregate as 8.x does support an order by in an aggregate function.

0人赞添加讨论(0) 举报

Anthone

4楼-- · 2020-06-13 13:41

For Oracle (version 11g R2):

By chance if you are using Oracle DB 11g R2, take look at listagg. The below code should work, but I haven't tested. The point is: you can use listagg.

SQL> select user,
  2         listagg( event, '' ) 
  3         within group (order by dt) events
  4     from users
  5    group by user
  6    order by dt
  7   /

     USER   EVENTS
---------  --------------------
1          ADBCB
2          BBAAC

In prior versions you can do with CONNECT BY clause. More details on listagg.

0人赞添加讨论(0) 举报

Querying sequences of rows in SQL

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间