Mutating multiple columns dynamically while condit

2020-06-12 03:19发布

问题:

I know there are several similar questions around here, but none of them seems to address the precise issue I'm having.

set.seed(4)
df = data.frame(
  Key = c("A", "B", "A", "D", "A"),
  Val1 = rnorm(5),
  Val2 = runif(5),
  Val3 = 1:5
)

I want to zeroise values of the value columns for the rows where Key == "A" The column names are referenced through a grep:

cols = grep("Val", names(df), value = TRUE)

Normally to achieve what I want in this case I would use data.table like this:

library(data.table)
df = as.data.table(df)
df[Key == "A", (cols) := 0]

And the desired output is like this:

  Key      Val1       Val2 Val3
1   A  0.000000 0.00000000    0
2   B -1.383814 0.55925762    2
3   A  0.000000 0.00000000    0
4   D  1.437151 0.05632773    4
5   A  0.000000 0.00000000    0

However this time I need to use dplyr as I am working on a team project where everyone uses it. The data I just provided is illustrative and my real data is >5m rows with 16 value columns to be updated. The only solution I could come up with is using mutate_at like this:

df %>% mutate_at(.vars = vars(cols), .funs = function(x) ifelse(df$Key == "A", 0, x))

However, this seems to be extremely slow on my real data. I was hoping to find a solution which is more elegant and, more importantly, faster.

I have tried many combinations using map, unquoting using !!, using get and := (which annoyingly can get masked by the := in data.table) etc, but I think my understanding of how these work is simply not deep enough to construct a valid solution.

回答1:

With this dplyr command,

df %>% mutate_at(.vars = vars(cols), .funs = function(x) ifelse(df$Key == "A", 0, x))

You are actually evaluating the statement df$Key == "A", n times, where n=the number of columns you have.

One work around is to pre-define the rows you want to change:

idx = which(DF$Key=="A")
DF %>% mutate_at(.vars = vars(cols), .funs = function(x){x[idx]=0;x})

A cleaner and better way, correctly pointed out by @IceCreamToucan (see comments below), is to use the function replace, while passing it the extra parameters:

DF %>% mutate_at(.vars = vars(cols), replace, DF$Key == 'A', 0)

We can put all these approaches to test, and I think dplyr and data.table are comparable.

#simulate data
set.seed(100)
Key = sample(LETTERS[1:3],1000000,replace=TRUE)
DF = as.data.frame(data.frame(Key,matrix(runif(1000000*10),nrow=1000000,ncol=10)))
DT = as.data.table(DF)

cols = grep("[35789]", names(DF), value = TRUE)

#long method
system.time(DF %>% mutate_at(.vars = vars(cols), .funs = function(x) ifelse(DF$Key == "A", 0, x)))
user  system elapsed 
  0.121   0.035   0.156 

#old base R way
system.time(DF[idx,cols] <- 0)
   user  system elapsed 
  0.085   0.021   0.106 

#dplyr
# define function
func = function(){
       idx = which(DF$Key=="A")
       DF %>% mutate_at(.vars = vars(cols), .funs = function(x){x[idx]=0;x})
}
system.time(func())
user  system elapsed 
  0.020   0.006   0.026

#data.table
system.time(DT[Key=="A", (cols) := 0])
   user  system elapsed 
  0.012   0.001   0.013 
#replace with dplyr
system.time(DF %>% mutate_at(.vars = vars(cols), replace, DF$Key == 'A', 0))
user  system elapsed 
  0.007   0.001   0.008