I have a sequence object created like this:
subsequences <- function(data){
slmax <- max(data$time)
sequences.seqe <- seqecreate(data)
sequences.sts <- seqformat(data, from="SPELL", to="DSS", begin="time", end="end", id="id", status="event", limit=slmax)
sequences.sts <- seqdef(sequences.sts, right = "DEL", left = "DEL", gaps = "DEL")
(sequences.sts)
}
data <- subsequences(data)
head(data)
Which gives the output:
Sequence
[1] discussed-subscribed-*-discussed-*-discussed-*-discussed-*-discussed-*-closed
[2] *-opened-*-reviewed-*-discussed-*-discussed-*-discussed-*-merged
[3] *-discussed-*-discussed-*-discussed-*-discussed
[4] *-opened-*-discussed-merged-discussed
[5] *-discussed-*-referenced-discussed-closed-discussed-referenced-discussed
[6] *-referenced-*-referenced-*-referenced-assigned-*-closed
But when I calculate the subsequences, I get seemingly ridiculous answers:
seqsubsn(head(data))
[!] found missing state in the sequence(s), adding missing state to the alphabet
Subseq.
[1] 1036
[2] 1248
[3] 88
[4] 49
[5] 294
[6] 240
How could the number of subsequences be far longer than the number of events in each sequence?
A 'dput()' of the dataset can be found here. The issue seems to be that the original data has time stamps in seconds. However, I've used the function below in order to change the timestamps to simply be sequential:
read_seqdata <- function(data, startdate, stopdate){
data <- read.table(data, sep = ",", header = TRUE)
data <- subset(data, select = c("pull_req_id", "action", "created_at"))
colnames(data) <- c("id", "event", "time")
data <- sqldf(paste0("SELECT * FROM data WHERE strftime('%Y-%m-%d', time,
'unixepoch', 'localtime') >= '",startdate,"' AND strftime('%Y-%m-%d', time,
'unixepoch', 'localtime') <= '",stopdate,"'"))
data$end <- data$time
data <- data[with(data, order(time)), ]
data$time <- match( data$time , unique( data$time ) )
data$end <- match( data$end , unique( data$end ) )
slmax <- max(data$time)
(data)
}
This makes it possible to create appropriate measures for entropy, sequence length etc., but the number of subsequences is still problematic.
The number of subsequences returned are not surprising at all. It is a matter of definition of 'subsequence', which should not be confused with 'substring'.
A sequence $x = (x_1, x_2, ... , x_3)$ is a subsequence of $y$ if its elements $x_i$ are all in $y$ and occur in the same order as in $y$. For instance, A-B-A is a subsequence of C-A-D-B-C-D-A-D.
To illustrate, consider the `mvad' example from the TraMineR package.
By default,
seqsubsn
computes the number of subsequences of the distinct successive states (DSS). The DSS of the first sequence, for example, is EM-TR-EM. The seven subsequences of EM-TR-EM are:Proceeding the same way you can verify that your fourth sequence (that is equal to its DSS)
has 49 subsequences, of which the nine two-length subsequences:
*-open
,*-discussed
,*-merged
,opened-*
,opened-discussed
,opened-merged
,discussed-merged
,discussed-discussed
,merged-discussed
Hope this helps