Is it possible to load a list of frequent subsequences from a .txt file, and make TraMineR recognize it as a sequence object?
Unfortunately I don't have the raw data, therefore I am not able to recreate the analysis. The only file what I have is a .txt file containing the frequent subsequences. I assume it was created with the seqefsub()
function from the TraMineR package, with maxGap=2
, because the data looks like as an output of the mentioned function.
read.table()
reads it as a data frame but as far as I understood, TraMineR handles event sequences as lists with many additional attributes, that for example are not contained in this file. Or I don't know how to extract them...
This is how the a couple of lines from the .txt file look like:
Subsequence Support Count
16 (WT4)-(WT3) 0.76666667 805
17 (WL2) 0.76380952 802
18 (S1) 0.76000000 798
19 (FRF,WL2) 0.74380952 781
20 (WT2)-(WT1) 0.70571429 741
To create an event sequence object from the (text) subsequences, you have to transform them into vertical time stamped event (TSE) form. The function below does the job for your data
## Function subseq.to.TSE
## puts the sequences into TSE format using
## position as timestamp
## sdf: a data frame with columns Id, Subsequence, Support and Count.
subseq.to.TSE <- function(sdf){
tse <- data.frame(id=0, event="", time=0)
k <- 0
for (i in 1:nrow(sdf)){
id <- sdf[i,"Id"]
s <- sdf[i,"Subsequence"]
ss <- gsub("\\(","",s)
ss <- gsub("\\)","",ss)
# split transitions
st <- strsplit(ss, split="-")[[1]]
for (j in 1:length(st)){
stt <- strsplit(st[j], split=",")[[1]]
for(jj in 1:length(stt)){
k <- k+1
tse[k,1] <- id
## parsing for simultaneous events
if (!(stt[jj] %in% levels(tse[,2])))
{levels(tse[,2]) <- c(levels(tse[,2]),stt[jj])}
tse[k,2] <- stt[jj]
tse[k,3] <- j
}
}
}
return(tse)
}
Here is how you would use it on the example data.
We first create the data frame that we name s.df
s.df <- data.frame(scan(what=list(Id=0, Subsequence="", Support=double(), Count=0)))
16 (WT4)-(WT3) 0.76666667 805
17 (WL2) 0.76380952 802
18 (S1) 0.76000000 798
19 (FRF,WL2) 0.74380952 781
20 (WT2)-(WT1) 0.70571429 741
# leave a blank line to end the scan
Then we extract the TSE data from s.df
and create from it the event sequence object using seqecreate
. Finally, we assign the counts as sequence weights.
s.tse <- subseq.to.TSE(s.df)
seqe <- seqecreate(s.tse)
seqeweight(seqe) <- s.df[,"Count"]
Now you can for instance plot the event sequences with
seqpcplot(seqe)