可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I'm in the process of trying out a dplyr-based workflow (rather than using mostly data.table, which I'm used to), and I've come across a problem that I can't find an equivalent dplyr solution to. I commonly run into the scenario where I need to conditionally update/replace several columns based on a single condition. Here's some example code, with my data.table solution:
library(data.table)
# Create some sample data
set.seed(1)
dt <- data.table(site = sample(1:6, 50, replace=T),
space = sample(1:4, 50, replace=T),
measure = sample(c('cfl', 'led', 'linear', 'exit'), 50,
replace=T),
qty = round(runif(50) * 30),
qty.exit = 0,
delta.watts = sample(10.5:100.5, 50, replace=T),
cf = runif(50))
# Replace the values of several columns for rows where measure is "exit"
dt <- dt[measure == 'exit',
`:=`(qty.exit = qty,
cf = 0,
delta.watts = 13)]
Is there a simple dplyr solution to this same problem? I'd like to avoid using ifelse because I don't want to have to type the condition multiple times - this is a simplified example, but there are sometimes many assignments based on a single condition.
Thanks in advance for the help!
回答1:
These solutions (1) maintain the pipeline, (2) do not overwrite the input and (3) only require that the condition be specified once:
1a) mutate_cond Create a simple function for data frames or data tables that can be incorporated into pipelines. This function is like mutate
but only acts on the rows satisfying the condition:
mutate_cond <- function(.data, condition, ..., envir = parent.frame()) {
condition <- eval(substitute(condition), .data, envir)
.data[condition, ] <- .data[condition, ] %>% mutate(...)
.data
}
DF %>% mutate_cond(measure == 'exit', qty.exit = qty, cf = 0, delta.watts = 13)
1b) mutate_last This is an alternative function for data frames or data tables which again is like mutate
but is only used within group_by
(as in the example below) and only operates on the last group rather than every group. Note that TRUE > FALSE so if group_by
specifies a condition then mutate_last
will only operate on rows satisfying that condition.
mutate_last <- function(.data, ...) {
n <- n_groups(.data)
indices <- attr(.data, "indices")[[n]] + 1
.data[indices, ] <- .data[indices, ] %>% mutate(...)
.data
}
DF %>%
group_by(is.exit = measure == 'exit') %>%
mutate_last(qty.exit = qty, cf = 0, delta.watts = 13) %>%
ungroup() %>%
select(-is.exit)
2) factor out condition Factor out the condition by making it an extra column which is later removed. Then use ifelse
, replace
or arithmetic with logicals as illustrated. This also works for data tables.
library(dplyr)
DF %>% mutate(is.exit = measure == 'exit',
qty.exit = ifelse(is.exit, qty, qty.exit),
cf = (!is.exit) * cf,
delta.watts = replace(delta.watts, is.exit, 13)) %>%
select(-is.exit)
3) sqldf We could use SQL update
via the sqldf package in the pipeline for data frames (but not data tables unless we convert them -- this may represent a bug in dplyr. See dplyr issue 1579). It may seem that we are undesirably modifying the input in this code due to the existence of the update
but in fact the update
is acting on a copy of the input in the temporarily generated database and not on the actual input.
library(sqldf)
DF %>%
do(sqldf(c("update '.'
set 'qty.exit' = qty, cf = 0, 'delta.watts' = 13
where measure = 'exit'",
"select * from '.'")))
Note 1: We used this as DF
set.seed(1)
DF <- data.frame(site = sample(1:6, 50, replace=T),
space = sample(1:4, 50, replace=T),
measure = sample(c('cfl', 'led', 'linear', 'exit'), 50,
replace=T),
qty = round(runif(50) * 30),
qty.exit = 0,
delta.watts = sample(10.5:100.5, 50, replace=T),
cf = runif(50))
Note 2: The problem of how to easily specify updating a subset of rows is also discussed in dplyr issues 134, 631, 1518 and 1573 with 631 being the main thread and 1573 being a review of the answers here.
回答2:
You can do this with magrittr
's two-way pipe %<>%
:
library(dplyr)
library(magrittr)
dt[dt$measure=="exit",] %<>% mutate(qty.exit = qty,
cf = 0,
delta.watts = 13)
This reduces the amount of typing, but is still much slower than data.table
.
回答3:
Here's a solution I like:
mutate_when <- function(data, ...) {
dots <- eval(substitute(alist(...)))
for (i in seq(1, length(dots), by = 2)) {
condition <- eval(dots[[i]], envir = data)
mutations <- eval(dots[[i + 1]], envir = data[condition, , drop = FALSE])
data[condition, names(mutations)] <- mutations
}
data
}
It lets you write things like e.g.
mtcars %>% mutate_when(
mpg > 22, list(cyl = 100),
disp == 160, list(cyl = 200)
)
which is quite readable -- although it may not be as performant as it could be.
回答4:
As eipi10 shows above, there's not a simple way to do a subset replacement in dplyr because DT uses pass-by-reference semantics vs dplyr using pass-by-value. dplyr requires the use of ifelse()
on the whole vector, whereas DT will do the subset and update by reference (returning the whole DT). So, for this exercise, DT will be substantially faster.
You could alternatively subset first, then update, and finally recombine:
dt.sub <- dt[dt$measure == "exit",] %>%
mutate(qty.exit= qty, cf= 0, delta.watts= 13)
dt.new <- rbind(dt.sub, dt[dt$measure != "exit",])
But DT is gonna be substantially faster:
(editted to use eipi10's new answer)
library(data.table)
library(dplyr)
library(microbenchmark)
microbenchmark(dt= {dt <- dt[measure == 'exit',
`:=`(qty.exit = qty,
cf = 0,
delta.watts = 13)]},
eipi10= {dt[dt$measure=="exit",] %<>% mutate(qty.exit = qty,
cf = 0,
delta.watts = 13)},
alex= {dt.sub <- dt[dt$measure == "exit",] %>%
mutate(qty.exit= qty, cf= 0, delta.watts= 13)
dt.new <- rbind(dt.sub, dt[dt$measure != "exit",])})
Unit: microseconds
expr min lq mean median uq max neval cld
dt 591.480 672.2565 747.0771 743.341 780.973 1837.539 100 a
eipi10 3481.212 3677.1685 4008.0314 3796.909 3936.796 6857.509 100 b
alex 3412.029 3637.6350 3867.0649 3726.204 3936.985 5424.427 100 b
回答5:
I just stumbled across this and really like mutate_cond()
by @G. Grothendieck, but thought it might come in handy to also handle new variables. So, below has two additions:
Unrelated: Second last line made a bit more dplyr
by using filter()
Three new lines at the beginning get variable names for use in mutate()
, and initializes any new variables in the data frame before mutate()
occurs. New variables are initialized for the remainder of the data.frame
using new_init
, which is set to missing (NA
) as a default.
mutate_cond <- function(.data, condition, ..., new_init = NA, envir = parent.frame()) {
# Initialize any new variables as new_init
new_vars <- substitute(list(...))[-1]
new_vars %<>% sapply(deparse) %>% names %>% setdiff(names(.data))
.data[, new_vars] <- new_init
condition <- eval(substitute(condition), .data, envir)
.data[condition, ] <- .data %>% filter(condition) %>% mutate(...)
.data
}
Here are some examples using the iris data:
Change Petal.Length
to 88 where Species == "setosa"
. This will work in the original function as well as this new version.
iris %>% mutate_cond(Species == "setosa", Petal.Length = 88)
Same as above, but also create a new variable x
(NA
in rows not included in the condition). Not possible before.
iris %>% mutate_cond(Species == "setosa", Petal.Length = 88, x = TRUE)
Same as above, but rows not included in the condition for x
are set to FALSE.
iris %>% mutate_cond(Species == "setosa", Petal.Length = 88, x = TRUE, new_init = FALSE)
This example shows how new_init
can be set to a list
to initialize multiple new variables with different values. Here, two new variables are created with excluded rows being initialized using different values (x
initialised as FALSE
, y
as NA
)
iris %>% mutate_cond(Species == "setosa" & Sepal.Length < 5,
x = TRUE, y = Sepal.Length ^ 2,
new_init = list(FALSE, NA))
回答6:
mutate_cond is a great function, but it gives an error if there is an NA in the column(s) used to create the condition. I feel that a conditional mutate should simply leave such rows alone. This matches the behavior of filter(), which returns rows when the condition is TRUE, but omits both rows with FALSE and NA.
With this small change the function works like a charm:
mutate_cond <- function(.data, condition, ..., envir = parent.frame()) {
condition <- eval(substitute(condition), .data, envir)
condition[is.na(condition)] = FALSE
.data[condition, ] <- .data[condition, ] %>% mutate(...)
.data
}
回答7:
I don't actually see any changes to dplyr
that would make this much easier. case_when
is great for when there are multiple different conditions and outcomes for one column but it doesn't help for this case where you want to change multiple columns based on one condition. Similarly, recode
saves typing if you are replacing multiple different values in one column but doesn't help with doing so in multiple columns at once. Finally, mutate_at
etc. only apply conditions to the column names not the rows in the dataframe. You could potentially write a function for mutate_at that would do it but I can't figure out how you would make it behave differently for different columns.
That said here is how I would approach it using nest
form tidyr
and map
from purrr
.
library(data.table)
library(dplyr)
library(tidyr)
library(purrr)
# Create some sample data
set.seed(1)
dt <- data.table(site = sample(1:6, 50, replace=T),
space = sample(1:4, 50, replace=T),
measure = sample(c('cfl', 'led', 'linear', 'exit'), 50,
replace=T),
qty = round(runif(50) * 30),
qty.exit = 0,
delta.watts = sample(10.5:100.5, 50, replace=T),
cf = runif(50))
dt2 <- dt %>%
nest(-measure) %>%
mutate(data = if_else(
measure == "exit",
map(data, function(x) mutate(x, qty.exit = qty, cf = 0, delta.watts = 13)),
data
)) %>%
unnest()
回答8:
With the creation of rlang
, a slightly modified version of Grothendieck's 1a example is possible, eliminating the need for the envir
argument, as enquo()
captures the environment that .p
is created in automatically.
mutate_rows <- function(.data, .p, ...) {
.p <- rlang::enquo(.p)
.p_lgl <- rlang::eval_tidy(.p, .data)
.data[.p_lgl, ] <- .data[.p_lgl, ] %>% mutate(...)
.data
}
dt %>% mutate_rows(measure == "exit", qty.exit = qty, cf = 0, delta.watts = 13)
回答9:
I think this answer has not been mentioned before. It runs almost as fast as the 'default' data.table
-solution..
Use base::replace()
df %>% mutate( qty.exit = replace( qty.exit, measure == 'exit', qty[ measure == 'exit'] ),
cf = replace( cf, measure == 'exit', 0 ),
delta.watts = replace( delta.watts, measure == 'exit', 13 ) )
replace recycles the replacement value, so when you want the values of columns qty
entered into colums qty.exit
, you have to subset qty
as well... hence the qty[ measure == 'exit']
in the first replacement..
now, you will probably not want to retype the measure == 'exit'
all the time... so you van create an index-vector containing that selection, and use it in the functions above..
#build an index-vector matching the condition
index.v <- which( df$measure == 'exit' )
df %>% mutate( qty.exit = replace( qty.exit, index.v, qty[ index.v] ),
cf = replace( cf, index.v, 0 ),
delta.watts = replace( delta.watts, index.v, 13 ) )
benchmarks
# Unit: milliseconds
# expr min lq mean median uq max neval
# data.table 1.005018 1.053370 1.137456 1.112871 1.186228 1.690996 100
# wimpel 1.061052 1.079128 1.218183 1.105037 1.137272 7.390613 100
# wimpel.index 1.043881 1.064818 1.131675 1.085304 1.108502 4.192995 100
回答10:
One concise solution would be to do the mutation on the filtered subset and then add back the non-exit rows of the table:
library(dplyr)
dt %>%
filter(measure == 'exit') %>%
mutate(qty.exit = qty, cf = 0, delta.watts = 13) %>%
rbind(dt %>% filter(measure != 'exit'))
回答11:
You could split the dataset and do a regular mutate call on the TRUE
part.
dplyr 0.8 features the function group_split
which splits by groups (and groups can be defined directly in the call) so we'll use it here, but base::split
works as well.
library(tidyverse)
df1 %>%
group_split(measure == "exit", keep=FALSE) %>% # or `split(.$measure == "exit")`
modify_at(2,~mutate(.,qty.exit = qty, cf = 0, delta.watts = 13)) %>%
bind_rows()
# site space measure qty qty.exit delta.watts cf
# 1 1 4 led 1 0 73.5 0.246240409
# 2 2 3 cfl 25 0 56.5 0.360315879
# 3 5 4 cfl 3 0 38.5 0.279966850
# 4 5 3 linear 19 0 40.5 0.281439486
# 5 2 3 linear 18 0 82.5 0.007898384
# 6 5 1 linear 29 0 33.5 0.392412729
# 7 5 3 linear 6 0 46.5 0.970848817
# 8 4 1 led 10 0 89.5 0.404447182
# 9 4 1 led 18 0 96.5 0.115594622
# 10 6 3 linear 18 0 15.5 0.017919745
# 11 4 3 led 22 0 54.5 0.901829577
# 12 3 3 led 17 0 79.5 0.063949974
# 13 1 3 led 16 0 86.5 0.551321441
# 14 6 4 cfl 5 0 65.5 0.256845013
# 15 4 2 led 12 0 29.5 0.340603733
# 16 5 3 linear 27 0 63.5 0.895166931
# 17 1 4 led 0 0 47.5 0.173088800
# 18 5 3 linear 20 0 89.5 0.438504370
# 19 2 4 cfl 18 0 45.5 0.031725246
# 20 2 3 led 24 0 94.5 0.456653397
# 21 3 3 cfl 24 0 73.5 0.161274319
# 22 5 3 led 9 0 62.5 0.252212124
# 23 5 1 led 15 0 40.5 0.115608182
# 24 3 3 cfl 3 0 89.5 0.066147321
# 25 6 4 cfl 2 0 35.5 0.007888337
# 26 5 1 linear 7 0 51.5 0.835458916
# 27 2 3 linear 28 0 36.5 0.691483644
# 28 5 4 led 6 0 43.5 0.604847889
# 29 6 1 linear 12 0 59.5 0.918838163
# 30 3 3 linear 7 0 73.5 0.471644760
# 31 4 2 led 5 0 34.5 0.972078100
# 32 1 3 cfl 17 0 80.5 0.457241602
# 33 5 4 linear 3 0 16.5 0.492500255
# 34 3 2 cfl 12 0 44.5 0.804236607
# 35 2 2 cfl 21 0 50.5 0.845094268
# 36 3 2 linear 10 0 23.5 0.637194873
# 37 4 3 led 6 0 69.5 0.161431896
# 38 3 2 exit 19 19 13.0 0.000000000
# 39 6 3 exit 7 7 13.0 0.000000000
# 40 6 2 exit 20 20 13.0 0.000000000
# 41 3 2 exit 1 1 13.0 0.000000000
# 42 2 4 exit 19 19 13.0 0.000000000
# 43 3 1 exit 24 24 13.0 0.000000000
# 44 3 3 exit 16 16 13.0 0.000000000
# 45 5 3 exit 9 9 13.0 0.000000000
# 46 2 3 exit 6 6 13.0 0.000000000
# 47 4 1 exit 1 1 13.0 0.000000000
# 48 1 1 exit 14 14 13.0 0.000000000
# 49 6 3 exit 7 7 13.0 0.000000000
# 50 2 4 exit 3 3 13.0 0.000000000
If row order matters, use tibble::rowid_to_column
first, then dplyr::arrange
on rowid
and select it out in the end.
data
df1 <- data.frame(site = sample(1:6, 50, replace=T),
space = sample(1:4, 50, replace=T),
measure = sample(c('cfl', 'led', 'linear', 'exit'), 50,
replace=T),
qty = round(runif(50) * 30),
qty.exit = 0,
delta.watts = sample(10.5:100.5, 50, replace=T),
cf = runif(50),
stringsAsFactors = F)
回答12:
At the expense of breaking with the usual dplyr syntax, you can use within
from base:
dt %>% within(qty.exit[measure == 'exit'] <- qty[measure == 'exit'],
delta.watts[measure == 'exit'] <- 13)
It seems to integrate well with the pipe, and you can do pretty much anything you want inside it.