How to create a variable (that captures increases

2019-09-10 15:39发布

I have a dataset that looks something like this

    Subject  Year   X   Y   
        A   1990    1   0   
        A   1991    1   0   
        A   1992    2   0   
        A   1993    3   1   
        A   1994    4   0   
        A   1995    4   0   
        B   1990    0   0   
        B   1991    1   0   
        B   1992    1   0   
        B   1993    2   1   
        C   1991    1   0   
        C   1992    2   0   
        C   1993    3   0   
        C   1994    3   0   
        D   1991    1   0   
        D   1992    2   0   
        D   1993    3   0   
        D   1994    4   0   
        D   1995    5   0   
        D   1996    5   1   
        D   1997    6   0   

How can I create two additional columns where

  • A1 is 1 if X increased and the maximum for the subject is at least 4. Otherwise it is 0. I tried data$A1 <- as.numeric(data$X >4) However, it's not quite what I want.
  • A2 is a bit more complicated to explain and I have no clue how to perform it in R. But it basically has the same idea as A1 meaning that it still should capture all X's that are more than 3. Only, it should be = 1 when Y = 0 for the following 5 years. I give an example what the A2 variable should look like. Is it possible do this in R? Or do I need to do this manually?

Result:

            Subject  Year   X   A1   Y   A2
                A   1990    1    1   0    0
                A   1991    1    0   0    0
                A   1992    2    1   0    0
                A   1993    3    1   1    0
                A   1994    4    1   0    0
                A   1995    4    0   0    0
                B   1990    0    0   0    0
                B   1991    1    0   0    0
                B   1992    1    0   0    0 
                B   1993    2    0   1    0
                C   1991    1    0   0    0
                C   1992    2    0   0    0 
                C   1993    3    0   0    0 
                C   1994    3    0   0    0
                D   1991    1    1   0    1
                D   1992    2    1   0    1
                D   1993    3    1   0    1
                D   1994    4    1   0    1 
                D   1995    5    1   0    1 
                D   1996    5    0   1    0
                D   1997    6    1   0    0

Rawdata without the variables A1 and A2:

> dput(data)
structure(list(Subject = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), .Label = c("A", 
"B", "C", "D"), class = "factor"), Year = c(1990L, 1991L, 1992L, 
1993L, 1994L, 1995L, 1990L, 1991L, 1992L, 1993L, 1991L, 1992L, 
1993L, 1994L, 1991L, 1992L, 1993L, 1994L, 1995L, 1996L, 1997L
), X = c(1L, 1L, 2L, 3L, 4L, 4L, 0L, 1L, 1L, 2L, 1L, 2L, 3L, 
3L, 1L, 2L, 3L, 4L, 5L, 5L, 6L), Y = c(0L, 0L, 0L, 1L, 0L, 0L, 
0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L)), .Names = c("Subject", 
"Year", "X", "Y"), class = "data.frame", row.names = c(NA, -21L
))

2条回答
Fickle 薄情
2楼-- · 2019-09-10 16:40

We can do this with data.table

library(data.table)
setDT(data)[, A1 := if(any(X >=4)) c(1, diff(X)) else 0, by = Subject]
data[,  A2 := if(any(X >=3))  inverse.rle(within.list(rle(Y==0), 
              values[values][lengths[values] < 5] <- 0)) else 0, by = Subject]

data[, c("Subject", "Year", "X", "A1", "Y", "A2"), with = FALSE]
#    Subject Year X A1 Y A2
# 1:       A 1990 1  1 0  0
# 2:       A 1991 1  0 0  0
# 3:       A 1992 2  1 0  0
# 4:       A 1993 3  1 1  0
# 5:       A 1994 4  1 0  0
# 6:       A 1995 4  0 0  0
# 7:       B 1990 0  0 0  0
# 8:       B 1991 1  0 0  0
# 9:       B 1992 1  0 0  0
#10:       B 1993 2  0 1  0
#11:       C 1991 1  0 0  0
#12:       C 1992 2  0 0  0
#13:       C 1993 3  0 0  0
#14:       C 1994 3  0 0  0
#15:       D 1991 1  1 0  1
#16:       D 1992 2  1 0  1
#17:       D 1993 3  1 0  1
#18:       D 1994 4  1 0  1
#19:       D 1995 5  1 0  1
#20:       D 1996 5  0 1  0
#21:       D 1997 6  1 0  0
查看更多
Anthone
3楼-- · 2019-09-10 16:42

Does that do the job? Do you need the Structure as factor? The code below does not yet realize the change in structure e.g. from C to D.

mydata <- structure("Your code here")
mydata$max <- rep(F, nrow(mydata))
mydata$A1 <- rep(0, nrow(mydata))
mydata$A2 <- rep(0, nrow(mydata))

for (i in unique(mydata$Subject)) {
  max <- max(mydata$X[mydata$Subject == i])
  if (max >=3) {
    mydata$max[mydata$Subject == i] <- T
  }
}
mydata$A1 <- ifelse(mydata$max & c(F,diff(mydata$X) > 0), 1, 0)

A2 is still unclear (See also my edit). Hopefully this helps to get the rest done.

查看更多
登录 后发表回答