How to create a “conditional” variable in R?

2020-04-21 07:21发布

问题:

I want to create a conditional dummy variable. Assume that I have a dataset that looks something like this:

Subject Year    X   X1
   A    1990    1   0
   A    1991    1   0
   A    1992    2   0
   A    1993    3   0
   A    1994    4   0
   A    1995    4   1
   B    1990    0   0
   B    1991    1   0
   B    1992    1   0
   B    1993    2   0
   B    1994    3   0
   C    1990    1   0
   C    1991    2   0
   C    1992    3   1
   C    1993    3   0
   D    1990    1   0
   D    1991    2   0
   D    1992    3   0
   D    1993    4   1
   D    1994    5   0
   E    1990    1   0
   E    1991    1   0
   E    1992    2   1
   E    1993    3   0

Let's call this conditional variable:Q1to3_noX1. Another variable of interest is Q1to3.

The Q1to3 variable is also a dummy variable indicating 1 when the X has reached value 3, and 0 otherwise (for each Subject). If the X is 4 or more, then the Q1to3 variable should be 0. The X is a cumulative variable (0,1,2,3,4...). So in other words, the Q1to3 is 1 if the maximum X value is 3.

I created this variable using: data$Q1to3 <- ave(data$X, data$Subject, FUN = function(x) if (max(x) == 3) 1 else 0) (thanks to @Zelazny7).

The Q1to3_noX1 variable is very similar to the Q1to3 variable, but in contrast to the Q1to3 , it is conditional on the X1 variable. To be more precise, if the X1 = 1 in the following 5 years (counting from the first year of Q1to3), the Q1to3_no5 should be 0. In other words, the Q1to3_noX1 should be 1 if a)the maximum X value is 3, b) if X1=0 following 5 years(otherwise 0).

I understand from this question that I should use the rlefunction. However, I haven't been able to apply it in this particular case. Do you have any suggestions?

The desirable outcome should look like this:

Subject Year    X   X1  Q1to3   Q1to3_noX1
   A    1990    1   0   0          0
   A    1991    1   0   0          0
   A    1992    2   0   0          0
   A    1993    3   0   0          0
   A    1994    4   0   0          0
   A    1995    4   1   0          0
   B    1990    0   0   1          0
   B    1991    1   0   1          1
   B    1992    1   0   1          1
   B    1993    2   0   1          1
   B    1994    3   0   1          1
   C    1990    1   0   1          0
   C    1991    2   0   1          0
   C    1992    3   1   1          0
   C    1993    3   0   1          0
   D    1990    1   0   0          0
   D    1991    2   0   0          0
   D    1992    3   0   0          0
   D    1993    4   1   0          0
   D    1994    5   0   0          0
   E    1990    1   0   1          0
   E    1991    1   0   1          0
   E    1992    2   1   1          0
   E    1993    3   0   1          0

A reproducible sample:

    > dput(data)
structure(list(Subject = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 
5L, 5L), .Label = c("A", "B", "C", "D", "E"), class = "factor"), 
    Year = c(1990L, 1991L, 1992L, 1993L, 1994L, 1995L, 1990L, 
    1991L, 1992L, 1993L, 1994L, 1990L, 1991L, 1992L, 1993L, 1990L, 
    1991L, 1992L, 1993L, 1994L, 1990L, 1991L, 1992L, 1993L), 
    X = c(1L, 1L, 2L, 3L, 4L, 4L, 0L, 1L, 1L, 2L, 3L, 1L, 2L, 
    3L, 3L, 1L, 2L, 3L, 4L, 5L, 1L, 1L, 2L, 3L), X1 = c(0L, 0L, 
    0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 
    0L, 1L, 0L, 0L, 0L, 1L, 0L), Q1to3 = c(0L, 0L, 0L, 0L, 0L, 
    0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 
    1L, 1L, 1L, 1L), Q1to3_noX1 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 
    1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
    0L, 0L)), .Names = c("Subject", "Year", "X", "X1", "Q1to3", 
"Q1to3_noX1"), class = "data.frame", row.names = c(NA, -24L))

回答1:

How about this?

data$cX1 <- do.call("c",tapply(data$X1, data$Subject, FUN = function(x){
  nx=length(x) #i=1
  sx=c()
  if (nx<5) sx[1:nx]<-sum(x[1:nx]) else
  for(i in 1:nx)sx[i]<-sum(x[i:min(i+5-1,nx)])
  sx
},simplify = T))

data$Q1to3_noX1f2<-ifelse(data$Q1to3==1 & data$cX1==0,1,0)


回答2:

Here's another example using Base R. I'm not 100% I understand the exact details of the question, but this pattern should solve your problem.

ave is great for broadcasting a summarized vector back to the original dimensions of the data. But if you look at the function body for ave it is just using split under the hood. We can do the same and create multiple columns per chunk instead of just one:

# split the data.frame
s <- split(df, df$Subject)

## calculate both columns at once per subject
both <- lapply(s, function(chunk) {
  Q1to3 <- if (max(chunk$X) == 3) 1 else 0
  Q1to3_noX1 <- if (Q1to3 == 1 & all(chunk$X1 == 0)) 1 else 0
  data.frame(Q1to3, Q1to3_noX1)
})

## cbind them back together and unsplit
out <- unsplit(Map(cbind, s, both), df$Subject)