I have a sequence object created like this:
subsequences <- function(data){
slmax <- max(data$time)
sequences.seqe <- seqecreate(data)
sequences.sts <- seqformat(data, from="SPELL", to="DSS", begin="time", end="end", id="id", status="event", limit=slmax)
sequences.sts <- seqdef(sequences.sts, right = "DEL", left = "DEL", gaps = "DEL")
(sequences.sts)
}
data <- subsequences(data)
head(data)
Which gives the output:
Sequence
[1] discussed-subscribed-*-discussed-*-discussed-*-discussed-*-discussed-*-closed
[2] *-opened-*-reviewed-*-discussed-*-discussed-*-discussed-*-merged
[3] *-discussed-*-discussed-*-discussed-*-discussed
[4] *-opened-*-discussed-merged-discussed
[5] *-discussed-*-referenced-discussed-closed-discussed-referenced-discussed
[6] *-referenced-*-referenced-*-referenced-assigned-*-closed
But when I calculate the subsequences, I get seemingly ridiculous answers:
seqsubsn(head(data))
[!] found missing state in the sequence(s), adding missing state to the alphabet
Subseq.
[1] 1036
[2] 1248
[3] 88
[4] 49
[5] 294
[6] 240
How could the number of subsequences be far longer than the number of events in each sequence?
A 'dput()' of the dataset can be found here. The issue seems to be that the original data has time stamps in seconds. However, I've used the function below in order to change the timestamps to simply be sequential:
read_seqdata <- function(data, startdate, stopdate){
data <- read.table(data, sep = ",", header = TRUE)
data <- subset(data, select = c("pull_req_id", "action", "created_at"))
colnames(data) <- c("id", "event", "time")
data <- sqldf(paste0("SELECT * FROM data WHERE strftime('%Y-%m-%d', time,
'unixepoch', 'localtime') >= '",startdate,"' AND strftime('%Y-%m-%d', time,
'unixepoch', 'localtime') <= '",stopdate,"'"))
data$end <- data$time
data <- data[with(data, order(time)), ]
data$time <- match( data$time , unique( data$time ) )
data$end <- match( data$end , unique( data$end ) )
slmax <- max(data$time)
(data)
}
This makes it possible to create appropriate measures for entropy, sequence length etc., but the number of subsequences is still problematic.