Load frequent subsequences from TXT

Is it possible to load a list of frequent subsequences from a .txt file, and make TraMineR recognize it as a sequence object?

Unfortunately I don't have the raw data, therefore I am not able to recreate the analysis. The only file what I have is a .txt file containing the frequent subsequences. I assume it was created with the seqefsub() function from the TraMineR package, with maxGap=2, because the data looks like as an output of the mentioned function.

read.table() reads it as a data frame but as far as I understood, TraMineR handles event sequences as lists with many additional attributes, that for example are not contained in this file. Or I don't know how to extract them...

This is how the a couple of lines from the .txt file look like:

                                             Subsequence    Support  Count
16                                           (WT4)-(WT3) 0.76666667    805
17                                                 (WL2) 0.76380952    802
18                                                  (S1) 0.76000000    798
19                                             (FRF,WL2) 0.74380952    781
20                                           (WT2)-(WT1) 0.70571429    741

标签： r traminer

1条回答

傲

2楼-- · 2019-09-17 05:37

To create an event sequence object from the (text) subsequences, you have to transform them into vertical time stamped event (TSE) form. The function below does the job for your data

## Function subseq.to.TSE
##  puts the sequences into TSE format using
##  position as timestamp
##  sdf: a data frame with columns Id, Subsequence, Support and Count.

subseq.to.TSE <- function(sdf){
  tse <- data.frame(id=0, event="", time=0)
  k <- 0
  for (i in 1:nrow(sdf)){
    id <- sdf[i,"Id"]
    s <- sdf[i,"Subsequence"]
    ss <- gsub("\\(","",s)
    ss <- gsub("\\)","",ss)
    # split transitions
    st <- strsplit(ss, split="-")[[1]]
    for (j in 1:length(st)){
      stt <- strsplit(st[j], split=",")[[1]]
      for(jj in 1:length(stt)){
        k <- k+1
        tse[k,1] <- id
        ## parsing for simultaneous events
        if (!(stt[jj] %in% levels(tse[,2])))
          {levels(tse[,2]) <- c(levels(tse[,2]),stt[jj])}
        tse[k,2] <- stt[jj]
        tse[k,3] <- j
      }
    }
  }

  return(tse)
 }

Here is how you would use it on the example data.

We first create the data frame that we name s.df

s.df <- data.frame(scan(what=list(Id=0, Subsequence="", Support=double(), Count=0)))
16 (WT4)-(WT3) 0.76666667    805
17 (WL2) 0.76380952    802
18 (S1) 0.76000000    798
19 (FRF,WL2) 0.74380952    781
20 (WT2)-(WT1) 0.70571429    741

# leave a blank line to end the scan

Then we extract the TSE data from s.df and create from it the event sequence object using seqecreate. Finally, we assign the counts as sequence weights.

s.tse <- subseq.to.TSE(s.df)
seqe <- seqecreate(s.tse)
seqeweight(seqe) <- s.df[,"Count"]

Now you can for instance plot the event sequences with

seqpcplot(seqe)

0人赞添加讨论(0) 举报

Load frequent subsequences from TXT

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间