dplyr mutate/replace several columns on a subset o

2020-01-24 01:55发布

I'm in the process of trying out a dplyr-based workflow (rather than using mostly data.table, which I'm used to), and I've come across a problem that I can't find an equivalent dplyr solution to. I commonly run into the scenario where I need to conditionally update/replace several columns based on a single condition. Here's some example code, with my data.table solution:

library(data.table)

# Create some sample data
set.seed(1)
dt <- data.table(site = sample(1:6, 50, replace=T),
                 space = sample(1:4, 50, replace=T),
                 measure = sample(c('cfl', 'led', 'linear', 'exit'), 50, 
                               replace=T),
                 qty = round(runif(50) * 30),
                 qty.exit = 0,
                 delta.watts = sample(10.5:100.5, 50, replace=T),
                 cf = runif(50))

# Replace the values of several columns for rows where measure is "exit"
dt <- dt[measure == 'exit', 
         `:=`(qty.exit = qty,
              cf = 0,
              delta.watts = 13)]

Is there a simple dplyr solution to this same problem? I'd like to avoid using ifelse because I don't want to have to type the condition multiple times - this is a simplified example, but there are sometimes many assignments based on a single condition.

Thanks in advance for the help!

12条回答
Deceive 欺骗
2楼-- · 2020-01-24 02:28

These solutions (1) maintain the pipeline, (2) do not overwrite the input and (3) only require that the condition be specified once:

1a) mutate_cond Create a simple function for data frames or data tables that can be incorporated into pipelines. This function is like mutate but only acts on the rows satisfying the condition:

mutate_cond <- function(.data, condition, ..., envir = parent.frame()) {
  condition <- eval(substitute(condition), .data, envir)
  .data[condition, ] <- .data[condition, ] %>% mutate(...)
  .data
}

DF %>% mutate_cond(measure == 'exit', qty.exit = qty, cf = 0, delta.watts = 13)

1b) mutate_last This is an alternative function for data frames or data tables which again is like mutate but is only used within group_by (as in the example below) and only operates on the last group rather than every group. Note that TRUE > FALSE so if group_by specifies a condition then mutate_last will only operate on rows satisfying that condition.

mutate_last <- function(.data, ...) {
  n <- n_groups(.data)
  indices <- attr(.data, "indices")[[n]] + 1
  .data[indices, ] <- .data[indices, ] %>% mutate(...)
  .data
}


DF %>% 
   group_by(is.exit = measure == 'exit') %>%
   mutate_last(qty.exit = qty, cf = 0, delta.watts = 13) %>%
   ungroup() %>%
   select(-is.exit)

2) factor out condition Factor out the condition by making it an extra column which is later removed. Then use ifelse, replace or arithmetic with logicals as illustrated. This also works for data tables.

library(dplyr)

DF %>% mutate(is.exit = measure == 'exit',
              qty.exit = ifelse(is.exit, qty, qty.exit),
              cf = (!is.exit) * cf,
              delta.watts = replace(delta.watts, is.exit, 13)) %>%
       select(-is.exit)

3) sqldf We could use SQL update via the sqldf package in the pipeline for data frames (but not data tables unless we convert them -- this may represent a bug in dplyr. See dplyr issue 1579). It may seem that we are undesirably modifying the input in this code due to the existence of the update but in fact the update is acting on a copy of the input in the temporarily generated database and not on the actual input.

library(sqldf)

DF %>% 
   do(sqldf(c("update '.' 
                 set 'qty.exit' = qty, cf = 0, 'delta.watts' = 13 
                 where measure = 'exit'", 
              "select * from '.'")))

Note 1: We used this as DF

set.seed(1)
DF <- data.frame(site = sample(1:6, 50, replace=T),
                 space = sample(1:4, 50, replace=T),
                 measure = sample(c('cfl', 'led', 'linear', 'exit'), 50, 
                               replace=T),
                 qty = round(runif(50) * 30),
                 qty.exit = 0,
                 delta.watts = sample(10.5:100.5, 50, replace=T),
                 cf = runif(50))

Note 2: The problem of how to easily specify updating a subset of rows is also discussed in dplyr issues 134, 631, 1518 and 1573 with 631 being the main thread and 1573 being a review of the answers here.

查看更多
萌系小妹纸
3楼-- · 2020-01-24 02:28

You can do this with magrittr's two-way pipe %<>%:

library(dplyr)
library(magrittr)

dt[dt$measure=="exit",] %<>% mutate(qty.exit = qty,
                                    cf = 0,  
                                    delta.watts = 13)

This reduces the amount of typing, but is still much slower than data.table.

查看更多
一夜七次
4楼-- · 2020-01-24 02:29

I think this answer has not been mentioned before. It runs almost as fast as the 'default' data.table-solution..

Use base::replace()

df %>% mutate( qty.exit = replace( qty.exit, measure == 'exit', qty[ measure == 'exit'] ),
                          cf = replace( cf, measure == 'exit', 0 ),
                          delta.watts = replace( delta.watts, measure == 'exit', 13 ) )

replace recycles the replacement value, so when you want the values of columns qty entered into colums qty.exit, you have to subset qty as well... hence the qty[ measure == 'exit'] in the first replacement..

now, you will probably not want to retype the measure == 'exit' all the time... so you van create an index-vector containing that selection, and use it in the functions above..

#build an index-vector matching the condition
index.v <- which( df$measure == 'exit' )

df %>% mutate( qty.exit = replace( qty.exit, index.v, qty[ index.v] ),
               cf = replace( cf, index.v, 0 ),
               delta.watts = replace( delta.watts, index.v, 13 ) )

benchmarks

# Unit: milliseconds
#         expr      min       lq     mean   median       uq      max neval
# data.table   1.005018 1.053370 1.137456 1.112871 1.186228 1.690996   100
# wimpel       1.061052 1.079128 1.218183 1.105037 1.137272 7.390613   100
# wimpel.index 1.043881 1.064818 1.131675 1.085304 1.108502 4.192995   100
查看更多
做自己的国王
5楼-- · 2020-01-24 02:35

You could split the dataset and do a regular mutate call on the TRUE part.

dplyr 0.8 features the function group_split which splits by groups (and groups can be defined directly in the call) so we'll use it here, but base::split works as well.

library(tidyverse)
df1 %>%
  group_split(measure == "exit", keep=FALSE) %>% # or `split(.$measure == "exit")`
  modify_at(2,~mutate(.,qty.exit = qty, cf = 0, delta.watts = 13)) %>%
  bind_rows()

#    site space measure qty qty.exit delta.watts          cf
# 1     1     4     led   1        0        73.5 0.246240409
# 2     2     3     cfl  25        0        56.5 0.360315879
# 3     5     4     cfl   3        0        38.5 0.279966850
# 4     5     3  linear  19        0        40.5 0.281439486
# 5     2     3  linear  18        0        82.5 0.007898384
# 6     5     1  linear  29        0        33.5 0.392412729
# 7     5     3  linear   6        0        46.5 0.970848817
# 8     4     1     led  10        0        89.5 0.404447182
# 9     4     1     led  18        0        96.5 0.115594622
# 10    6     3  linear  18        0        15.5 0.017919745
# 11    4     3     led  22        0        54.5 0.901829577
# 12    3     3     led  17        0        79.5 0.063949974
# 13    1     3     led  16        0        86.5 0.551321441
# 14    6     4     cfl   5        0        65.5 0.256845013
# 15    4     2     led  12        0        29.5 0.340603733
# 16    5     3  linear  27        0        63.5 0.895166931
# 17    1     4     led   0        0        47.5 0.173088800
# 18    5     3  linear  20        0        89.5 0.438504370
# 19    2     4     cfl  18        0        45.5 0.031725246
# 20    2     3     led  24        0        94.5 0.456653397
# 21    3     3     cfl  24        0        73.5 0.161274319
# 22    5     3     led   9        0        62.5 0.252212124
# 23    5     1     led  15        0        40.5 0.115608182
# 24    3     3     cfl   3        0        89.5 0.066147321
# 25    6     4     cfl   2        0        35.5 0.007888337
# 26    5     1  linear   7        0        51.5 0.835458916
# 27    2     3  linear  28        0        36.5 0.691483644
# 28    5     4     led   6        0        43.5 0.604847889
# 29    6     1  linear  12        0        59.5 0.918838163
# 30    3     3  linear   7        0        73.5 0.471644760
# 31    4     2     led   5        0        34.5 0.972078100
# 32    1     3     cfl  17        0        80.5 0.457241602
# 33    5     4  linear   3        0        16.5 0.492500255
# 34    3     2     cfl  12        0        44.5 0.804236607
# 35    2     2     cfl  21        0        50.5 0.845094268
# 36    3     2  linear  10        0        23.5 0.637194873
# 37    4     3     led   6        0        69.5 0.161431896
# 38    3     2    exit  19       19        13.0 0.000000000
# 39    6     3    exit   7        7        13.0 0.000000000
# 40    6     2    exit  20       20        13.0 0.000000000
# 41    3     2    exit   1        1        13.0 0.000000000
# 42    2     4    exit  19       19        13.0 0.000000000
# 43    3     1    exit  24       24        13.0 0.000000000
# 44    3     3    exit  16       16        13.0 0.000000000
# 45    5     3    exit   9        9        13.0 0.000000000
# 46    2     3    exit   6        6        13.0 0.000000000
# 47    4     1    exit   1        1        13.0 0.000000000
# 48    1     1    exit  14       14        13.0 0.000000000
# 49    6     3    exit   7        7        13.0 0.000000000
# 50    2     4    exit   3        3        13.0 0.000000000

If row order matters, use tibble::rowid_to_column first, then dplyr::arrange on rowid and select it out in the end.

data

df1 <- data.frame(site = sample(1:6, 50, replace=T),
                 space = sample(1:4, 50, replace=T),
                 measure = sample(c('cfl', 'led', 'linear', 'exit'), 50, 
                                  replace=T),
                 qty = round(runif(50) * 30),
                 qty.exit = 0,
                 delta.watts = sample(10.5:100.5, 50, replace=T),
                 cf = runif(50),
                 stringsAsFactors = F)
查看更多
男人必须洒脱
6楼-- · 2020-01-24 02:37

I just stumbled across this and really like mutate_cond() by @G. Grothendieck, but thought it might come in handy to also handle new variables. So, below has two additions:

Unrelated: Second last line made a bit more dplyr by using filter()

Three new lines at the beginning get variable names for use in mutate(), and initializes any new variables in the data frame before mutate() occurs. New variables are initialized for the remainder of the data.frame using new_init, which is set to missing (NA) as a default.

mutate_cond <- function(.data, condition, ..., new_init = NA, envir = parent.frame()) {
  # Initialize any new variables as new_init
  new_vars <- substitute(list(...))[-1]
  new_vars %<>% sapply(deparse) %>% names %>% setdiff(names(.data))
  .data[, new_vars] <- new_init

  condition <- eval(substitute(condition), .data, envir)
  .data[condition, ] <- .data %>% filter(condition) %>% mutate(...)
  .data
}

Here are some examples using the iris data:

Change Petal.Length to 88 where Species == "setosa". This will work in the original function as well as this new version.

iris %>% mutate_cond(Species == "setosa", Petal.Length = 88)

Same as above, but also create a new variable x (NA in rows not included in the condition). Not possible before.

iris %>% mutate_cond(Species == "setosa", Petal.Length = 88, x = TRUE)

Same as above, but rows not included in the condition for x are set to FALSE.

iris %>% mutate_cond(Species == "setosa", Petal.Length = 88, x = TRUE, new_init = FALSE)

This example shows how new_init can be set to a list to initialize multiple new variables with different values. Here, two new variables are created with excluded rows being initialized using different values (x initialised as FALSE, y as NA)

iris %>% mutate_cond(Species == "setosa" & Sepal.Length < 5,
                  x = TRUE, y = Sepal.Length ^ 2,
                  new_init = list(FALSE, NA))
查看更多
叼着烟拽天下
7楼-- · 2020-01-24 02:38

I don't actually see any changes to dplyr that would make this much easier. case_when is great for when there are multiple different conditions and outcomes for one column but it doesn't help for this case where you want to change multiple columns based on one condition. Similarly, recode saves typing if you are replacing multiple different values in one column but doesn't help with doing so in multiple columns at once. Finally, mutate_at etc. only apply conditions to the column names not the rows in the dataframe. You could potentially write a function for mutate_at that would do it but I can't figure out how you would make it behave differently for different columns.

That said here is how I would approach it using nest form tidyr and map from purrr.

library(data.table)
library(dplyr)
library(tidyr)
library(purrr)

# Create some sample data
set.seed(1)
dt <- data.table(site = sample(1:6, 50, replace=T),
                 space = sample(1:4, 50, replace=T),
                 measure = sample(c('cfl', 'led', 'linear', 'exit'), 50, 
                                  replace=T),
                 qty = round(runif(50) * 30),
                 qty.exit = 0,
                 delta.watts = sample(10.5:100.5, 50, replace=T),
                 cf = runif(50))

dt2 <- dt %>% 
  nest(-measure) %>% 
  mutate(data = if_else(
    measure == "exit", 
    map(data, function(x) mutate(x, qty.exit = qty, cf = 0, delta.watts = 13)),
    data
  )) %>%
  unnest()
查看更多
登录 后发表回答