How to create a variable (that captures increases

2019-09-10 16:29发布

问题:

I have a dataset that looks something like this

    Subject  Year   X   Y   
        A   1990    1   0   
        A   1991    1   0   
        A   1992    2   0   
        A   1993    3   1   
        A   1994    4   0   
        A   1995    4   0   
        B   1990    0   0   
        B   1991    1   0   
        B   1992    1   0   
        B   1993    2   1   
        C   1991    1   0   
        C   1992    2   0   
        C   1993    3   0   
        C   1994    3   0   
        D   1991    1   0   
        D   1992    2   0   
        D   1993    3   0   
        D   1994    4   0   
        D   1995    5   0   
        D   1996    5   1   
        D   1997    6   0   

How can I create two additional columns where

  • A1 is 1 if X increased and the maximum for the subject is at least 4. Otherwise it is 0. I tried data$A1 <- as.numeric(data$X >4) However, it's not quite what I want.
  • A2 is a bit more complicated to explain and I have no clue how to perform it in R. But it basically has the same idea as A1 meaning that it still should capture all X's that are more than 3. Only, it should be = 1 when Y = 0 for the following 5 years. I give an example what the A2 variable should look like. Is it possible do this in R? Or do I need to do this manually?

Result:

            Subject  Year   X   A1   Y   A2
                A   1990    1    1   0    0
                A   1991    1    0   0    0
                A   1992    2    1   0    0
                A   1993    3    1   1    0
                A   1994    4    1   0    0
                A   1995    4    0   0    0
                B   1990    0    0   0    0
                B   1991    1    0   0    0
                B   1992    1    0   0    0 
                B   1993    2    0   1    0
                C   1991    1    0   0    0
                C   1992    2    0   0    0 
                C   1993    3    0   0    0 
                C   1994    3    0   0    0
                D   1991    1    1   0    1
                D   1992    2    1   0    1
                D   1993    3    1   0    1
                D   1994    4    1   0    1 
                D   1995    5    1   0    1 
                D   1996    5    0   1    0
                D   1997    6    1   0    0

Rawdata without the variables A1 and A2:

> dput(data)
structure(list(Subject = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), .Label = c("A", 
"B", "C", "D"), class = "factor"), Year = c(1990L, 1991L, 1992L, 
1993L, 1994L, 1995L, 1990L, 1991L, 1992L, 1993L, 1991L, 1992L, 
1993L, 1994L, 1991L, 1992L, 1993L, 1994L, 1995L, 1996L, 1997L
), X = c(1L, 1L, 2L, 3L, 4L, 4L, 0L, 1L, 1L, 2L, 1L, 2L, 3L, 
3L, 1L, 2L, 3L, 4L, 5L, 5L, 6L), Y = c(0L, 0L, 0L, 1L, 0L, 0L, 
0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L)), .Names = c("Subject", 
"Year", "X", "Y"), class = "data.frame", row.names = c(NA, -21L
))

回答1:

We can do this with data.table

library(data.table)
setDT(data)[, A1 := if(any(X >=4)) c(1, diff(X)) else 0, by = Subject]
data[,  A2 := if(any(X >=3))  inverse.rle(within.list(rle(Y==0), 
              values[values][lengths[values] < 5] <- 0)) else 0, by = Subject]

data[, c("Subject", "Year", "X", "A1", "Y", "A2"), with = FALSE]
#    Subject Year X A1 Y A2
# 1:       A 1990 1  1 0  0
# 2:       A 1991 1  0 0  0
# 3:       A 1992 2  1 0  0
# 4:       A 1993 3  1 1  0
# 5:       A 1994 4  1 0  0
# 6:       A 1995 4  0 0  0
# 7:       B 1990 0  0 0  0
# 8:       B 1991 1  0 0  0
# 9:       B 1992 1  0 0  0
#10:       B 1993 2  0 1  0
#11:       C 1991 1  0 0  0
#12:       C 1992 2  0 0  0
#13:       C 1993 3  0 0  0
#14:       C 1994 3  0 0  0
#15:       D 1991 1  1 0  1
#16:       D 1992 2  1 0  1
#17:       D 1993 3  1 0  1
#18:       D 1994 4  1 0  1
#19:       D 1995 5  1 0  1
#20:       D 1996 5  0 1  0
#21:       D 1997 6  1 0  0


回答2:

Does that do the job? Do you need the Structure as factor? The code below does not yet realize the change in structure e.g. from C to D.

mydata <- structure("Your code here")
mydata$max <- rep(F, nrow(mydata))
mydata$A1 <- rep(0, nrow(mydata))
mydata$A2 <- rep(0, nrow(mydata))

for (i in unique(mydata$Subject)) {
  max <- max(mydata$X[mydata$Subject == i])
  if (max >=3) {
    mydata$max[mydata$Subject == i] <- T
  }
}
mydata$A1 <- ifelse(mydata$max & c(F,diff(mydata$X) > 0), 1, 0)

A2 is still unclear (See also my edit). Hopefully this helps to get the rest done.