TSQL Time Series Pattern Data Mining

2019-04-12 06:11发布

问题:

Take a SQL table with the following 3 fields:

Id,TimeStamp,Item,UserId

I would like to determine the most common sequences of Item for a UserId in a session. A session would simply be defined by a threshold of time (i.e. if there are no entires for X minutes, any future entries would be grouped into a new session).

Ideally, the sequence of Items could have a sort of fuzzy grouping where one or two differences in the sequence could still be counted as the same and grouped together.

Anyone know how I might tackle this problem in SQL?

Update:
To clarify, lets pretend that the Items are grocery store isles. And I have a month of people visiting the grocery store. The basic question is what isles are people using and it what order. Do they go isles 1,2,3 or 1,2,1,3,4 most frequently?

(Right now I am curious about paths of users on our sites, but you know, grocery store is more visual).

Update 2:
Here is a simple case:

CREATE Table #StoreActivity
(
    id int,
    CreationDate datetime ,
    Isle int,
    UserId int
)

Insert INTO #StoreActivity
Values
    (1, CAST('12-1-2011 03:10:01' AS Datetime), 1, 2222),
    (2, CAST('12-1-2011 03:10:07' AS Datetime), 1, 1111),
    (3, CAST('12-1-2011 03:10:12' AS Datetime), 2, 2222),
    (4, CAST('12-1-2011 04:10:01' AS Datetime), 1, 2222),
    (5, CAST('12-1-2011 04:10:23' AS Datetime), 2, 2222)

Select * from #StoreActivity
DROP Table #StoreActivity

/* So with the above data, we have 2 sequences if we declare a session or visit dead if there is no activity for a minute : `1,2` (With a count of 2), and `1` (with a count of 1)*/

回答1:

WITH    q AS
        (
        SELECT  *,
                ROW_NUMBER() OVER (PARTITION BY UserId ORDER BY TimeStamp, Id) AS rn,
                ROW_NUMBER() OVER (PARTITION BY UserId, Item ORDER BY TimeStamp, Id) AS rnd
        FROM    mytable
        )
SELECT  *,
        rnd - rn AS sequence
FROM    q

The sequence column will be shared among all records in a sequence for a given UserId. You can group on it or do whatever you like.



回答2:

This sounds like the old fashion Association Rule mining problem (but ordered which makes it easier), and if I have understood the question right, there is not one [EFFICIENT] SQL answer to this problem.

  • If your problem involves maximum frequent route mining (e.g. a route 1,3,2 can be used in both routes 4,1,3,2 and 1,3,2,4), it would not be so easy to solve. I would suggest look at AR mining in SSAS data mining.

  • If your problem does not involve maximum frequent route mining (e.g. 1,3,2 will be counted only if it the exact same route it used) then it is just a matter of group by and count.

Before you will be able to use any of the above solutions, you need to figure out your sessions which can be done using the following script:

WITH ts as
(
    SELECT 
        id,
        UserId, 
        CreationDate,
        Isle,
        ROW_NUMBER() OVER (PARTITION BY UserId ORDER BY CreationDate) AS rowNum
    FROM #StoreActivity a

)
SELECT t2.*, KeepSession = CASE WHEN DATEDIFF(second,t1.CreationDate, t2.CreationDate) > 60 THEN 0 ELSE 1 END   
INTO #tempStore
FROM ts as t1
LEFT JOIN ts as t2 ON t1.rowNum = t2.rowNum - 1 
WHERE t1.UserId = t2.UserId

SELECT *, id, UserId, 
SessionId = (SELECT count(*) FROM #tempStore where KeepSession = 0 and UserId = t.UserId and rowNum <= t.rowNum)
FROM #tempStore t
--where t. t.KeepSession = 0

DROP TABLE #tempStore

Result will be:

id          UserId      CreationDate            Isle        rowNum               KeepSession id          UserId      SessionId
----------- ----------- ----------------------- ----------- -------------------- ----------- ----------- ----------- -----------
3           2222        2011-12-01 03:10:12.000 2           2                    1           3           2222        0
4           2222        2011-12-01 04:10:01.000 1           3                    0           4           2222        1
5           2222        2011-12-01 04:10:23.000 2           4                    1           5           2222        1
6           2222        2011-12-01 04:10:24.000 3           5                    1           6           2222        1
7           2222        2011-12-01 04:12:23.000 3           6                    0           7           2222        2
8           2222        2011-12-01 04:12:24.000 3           7                    1           8           2222        2

(6 row(s) affected)

NOTE: Above solution can be optimized for performance. #tempStore can be removed.