Load frequent subsequences from TXT

2019-09-17 04:40发布

问题:

Is it possible to load a list of frequent subsequences from a .txt file, and make TraMineR recognize it as a sequence object?

Unfortunately I don't have the raw data, therefore I am not able to recreate the analysis. The only file what I have is a .txt file containing the frequent subsequences. I assume it was created with the seqefsub() function from the TraMineR package, with maxGap=2, because the data looks like as an output of the mentioned function.

read.table() reads it as a data frame but as far as I understood, TraMineR handles event sequences as lists with many additional attributes, that for example are not contained in this file. Or I don't know how to extract them...

This is how the a couple of lines from the .txt file look like:

                                             Subsequence    Support  Count
16                                           (WT4)-(WT3) 0.76666667    805
17                                                 (WL2) 0.76380952    802
18                                                  (S1) 0.76000000    798
19                                             (FRF,WL2) 0.74380952    781
20                                           (WT2)-(WT1) 0.70571429    741

回答1:

To create an event sequence object from the (text) subsequences, you have to transform them into vertical time stamped event (TSE) form. The function below does the job for your data

## Function subseq.to.TSE
##  puts the sequences into TSE format using
##  position as timestamp
##  sdf: a data frame with columns Id, Subsequence, Support and Count.

subseq.to.TSE <- function(sdf){
  tse <- data.frame(id=0, event="", time=0)
  k <- 0
  for (i in 1:nrow(sdf)){
    id <- sdf[i,"Id"]
    s <- sdf[i,"Subsequence"]
    ss <- gsub("\\(","",s)
    ss <- gsub("\\)","",ss)
    # split transitions
    st <- strsplit(ss, split="-")[[1]]
    for (j in 1:length(st)){
      stt <- strsplit(st[j], split=",")[[1]]
      for(jj in 1:length(stt)){
        k <- k+1
        tse[k,1] <- id
        ## parsing for simultaneous events
        if (!(stt[jj] %in% levels(tse[,2])))
          {levels(tse[,2]) <- c(levels(tse[,2]),stt[jj])}
        tse[k,2] <- stt[jj]
        tse[k,3] <- j
      }
    }
  }

  return(tse)
 }

Here is how you would use it on the example data.

We first create the data frame that we name s.df

s.df <- data.frame(scan(what=list(Id=0, Subsequence="", Support=double(), Count=0)))
16 (WT4)-(WT3) 0.76666667    805
17 (WL2) 0.76380952    802
18 (S1) 0.76000000    798
19 (FRF,WL2) 0.74380952    781
20 (WT2)-(WT1) 0.70571429    741

# leave a blank line to end the scan

Then we extract the TSE data from s.df and create from it the event sequence object using seqecreate. Finally, we assign the counts as sequence weights.

s.tse <- subseq.to.TSE(s.df)
seqe <- seqecreate(s.tse)
seqeweight(seqe) <- s.df[,"Count"] 

Now you can for instance plot the event sequences with

seqpcplot(seqe)


标签: r traminer