Complex long to wide data transformation (with tim

2019-05-30 10:58发布

问题:

I am currently working on a Multistate Analysis dataset in "long" form (one row for each individual's observation; each individual is repeatedly measured up to 5 times).

The idea is that each individual can recurrently transition across the levels of the time-varying state variable s = 1, 2, 3, 4. All the other variables that I have (here cohort) are fixed within any given id.

After some analyses, I need to reshape the dataset in "wide" form, according to the specific sequence of visited states. Here is an example of the initial long data:

  dat <- read.table(text = "

        id    cohort    s    
        1       1       2
        1       1       2
        1       1       1
        1       1       4
        2       3       1
        2       3       1
        2       3       3
        3       2       1
        3       2       2
        3       2       3
        3       2       3
        3       2       4", 

    header=TRUE)     

The final "wide" dataset should take into account the specific individual sequence of visited states, recorded into the newly created variables s1, s2, s3, s4, s5, where s1 is the first state visited by the individual and so on.

According to the above example, the wide dataset looks like:

    id    cohort    s1    s2    s3    s4    s5    
    1       1       2      2     1     4     0
    2       3       1      1     3     0     0
    3       2       1      2     3     3     4

I tried to use reshape(), and also to focus on transposing s, but without the intended result. Actually, my knowledge of the R functions is quite limited.. Can you give any suggestion? Thanks.

EDIT: obtaining a different kind of wide dataset

Thank you all for your help, I have a related question if I can. Especially when each individual is observed for a long time and there are few transitions across states, it is very useful to reshape the initial sample dat in this alternative way:

    id    cohort    s1    s2    s3    s4    s5    dur1  dur2  dur3  dur4  dur5 
    1       1       2      1     4     0     0      2     1     1     0     0  
    2       3       1      3     0     0     0      2     1     0     0     0
    3       2       1      2     3     4     0      1     1     2     1     0

In practice now s1-s5 are the distinct visited states, and dur1-dur5 the time spent in each respective distinct visited state.

Can you please give a hand for reaching this data structure? I believe it is necessary to create all the dur- and s- variables in an intermediate sample before using reshape(). Otherwise maybe it is possible to directly adopt -reshape2-?

回答1:

dat <- read.table(text = "
        id    cohort    s    
        1       1       2
        1       1       2
        1       1       1
        1       1       4
        2       3       1
        2       3       1
        2       3       3
        3       2       1
        3       2       2
        3       2       3
        3       2       3
        3       2       4", 
    header=TRUE)     

df <- data.frame(
    dat,
    period = sequence(rle(dat$id)$lengths) 
)

wide <- reshape(df, v.names = "s", idvar = c("id", "cohort"),
                timevar = "period", direction = "wide")

wide[is.na(wide)] = 0
wide

Gives:

  id cohort s.1 s.2 s.3 s.4 s.5
1  1      1   2   2   1   4   0
5  2      3   1   1   3   0   0
8  3      2   1   2   3   3   4

then using the following line gives your names:

names(wide) <- c('id','cohort', paste('s', seq_along(1:5), sep=''))

#   id cohort s1 s2 s3 s4 s5
# 1  1      1  2  2  1  4  0
# 5  2      3  1  1  3  0  0
# 8  3      2  1  2  3  3  4

If you use sep='' in the wide statement you do not have to rename the variables:

wide <- reshape(df, v.names = "s", idvar = c("id", "cohort"),
                timevar = "period", direction = "wide", sep='')

I suspect there are ways to avoid creating the period variable and avoid replacing NA directly in the wide statement, but I have not figured those out yet.



回答2:

ok...

library(plyr)
library(reshape2)

dat2 <- ddply(dat,.(id,cohort), function(x) 
       data.frame(s=x$s,name=paste0("s",seq_along(x$s))))


dat2 <- ddply(dat2,.(id,cohort), function(x) 
       dcast(x, id + cohort ~ name, value.var= "s" ,fill= 0)
       )

dat2[is.na(dat2)] <- 0

dat2

#    id cohort s1 s2 s3 s4 s5
#    1  1      1  2  2  1  4  0
#    2  2      3  1  1  3  0  0
#    3  3      2  1  2  3  3  4

This seem right? I admit the first ddply is hardly elegant.



回答3:

Try this:

library(reshape2)

dat$seq <- ave(dat$id, dat$id, FUN = function(x) paste0("s", seq_along(x)))
dat.s <- dcast(dat, id + cohort ~ seq, value.var = "s", fill = 0)

which gives this:

> dat.s
  id cohort s1 s2 s3 s4 s5
1  1      1  2  2  1  4  0
2  2      3  1  1  3  0  0
3  3      2  1  2  3  3  4

If you did not mind using just 1, 2, ..., 5 as column names then you could shorten the ave line to just:

dat$seq <- ave(dat$id, dat$id, FUN = seq_along)

Regarding the second question that was added later try this:

library(plyr)
dur.fn <- function(x) {
  r <- rle(x$s)$length
  data.frame(id = x$id[1], dur.value = r, dur.seq = paste0("dur", seq_along(r)))
}
dat.dur.long <- ddply(dat, .(id), dur.fn)
dat.dur <- dcast(dat.dur.long, id ~ dur.seq, c, value.var = "dur.value", fill = 0)
cbind(dat.s, dat.dur[-1])

which gives:

  id cohort s1 s2 s3 s4 s5 dur1 dur2 dur3 dur4
1  1      1  2  2  1  4  0    2    1    1    0
2  2      3  1  1  3  0  0    2    1    0    0
3  3      2  1  2  3  3  4    1    1    2    1


标签: r reshape