Looping across multiple variables and parameters u

2019-02-28 22:43发布

问题:

I'm having trouble figuring out how to effective map across multiple parameters and variables within a tbl to generate new variables.

In the "real" version, I basically have one mathematical function generating a central estimate, and I need to run a whole series of sensitivity tests varying different parameters. I'm trying to figure out how to do this within the tidyverse. It looks like map() and mutate() are the answers to this, but I'm having trouble.

    # building the practice dataset
    pracdf <- tibble(ID = letters,
             p = runif(26, 100, 1000),
             med.a = runif(26),
             med.b = runif(26),
             c = runif(26))

    pracdf <- pracdf %>%
      mutate(low.a = med.a * 0.8,
             low.b = med.b * 0.8,
             high.a = med.a * 1.2,
             high.b = med.b * 1.2)
    # this generates a few low/med/high values for variables


    # the function
    pracdf <- pracdf %>% mutate(d = p * med.a * med.b * c)
    # works as expected. Now can I loop it with dynamic variable names?


    f1 <- function(df, var.a) {
      var.a <- enquo(var.a)
      print(var.a)
      d.name <- paste0("d.", quo_name(var.a))
      print(d.name)

      df %>% mutate(!!d.name := p * (!!var.a) * c)
    }

    pracdf2 <- f1(pracdf, med.a)
    # works great! Eventually I want to loop through low, med, high. Start with a loop of 1

    pracdf3 <- map(list(med.a), f1, df = pracdf)
    # loop crashes spectacularly
    pracdf3 <- map(list(med.a), ~f1, df = pracdf)
    # failure
    pracdf3 <- map(med.a, ~f1, df = pracdf)
    # what am I doing with my life

回答1:

I think one of the issues making this task difficult is the current set up might not be very "tidy". E.g. low.a, low.b, med.a etc appear to be examples of what I understand to be 'untidy' columns.

Below is one possible approach (which I am fairly sure can probably be improved) which doesn't use a for loop or custom function at all. The key idea is to take the initial pracdf and expand the existing rows so there is one row for each "level" (i.e., low, med, and high). Doing this lets us calculate d in a single step with no for loops for low, med, and high.

(Edited for readability and to include Jens Leerssen's suggestions)

library(dplyr)
library(tidyr)
set.seed(123)
pracdf <- tibble(ID = letters,
                 p = runif(26, 100, 1000),
                 a = runif(26),
                 b = runif(26),
                 c = runif(26))

levdf <- tibble(level = c("low", "med", "high"),
                level_val = c(0.8, 1.0, 1.2))

tidy_df <- pracdf %>% merge(levdf) %>%
  mutate(d = p * (level_val * a) * (level_val * b) * c) %>%
  select(-level_val) %>% arrange(ID) %>% as_tibble()

tidy_df

#> # A tibble: 78 x 7
#>       ID        p         a         b         c level         d
#>    <chr>    <dbl>     <dbl>     <dbl>     <dbl> <chr>     <dbl>
#>  1     a 358.8198 0.5440660 0.7989248 0.3517979   low 35.116168
#>  2     a 358.8198 0.5440660 0.7989248 0.3517979   med 54.869013
#>  3     a 358.8198 0.5440660 0.7989248 0.3517979  high 79.011379
#>  4     b 809.4746 0.5941420 0.1218993 0.1111354   low  4.169914
#>  5     b 809.4746 0.5941420 0.1218993 0.1111354   med  6.515490
#>  6     b 809.4746 0.5941420 0.1218993 0.1111354  high  9.382306
#>  7     c 468.0792 0.2891597 0.5609480 0.2436195   low 11.837821
#>  8     c 468.0792 0.2891597 0.5609480 0.2436195   med 18.496595
#>  9     c 468.0792 0.2891597 0.5609480 0.2436195  high 26.635096
#> 10     d 894.7157 0.1471136 0.2065314 0.6680556   low 11.622957
#> # ... with 68 more rows

However, the result above might not be the format you want the final data in. But we can take care of this by doing some gathering and spreading of tidy_df using tidyr::gather and tidyr::spread.

tidy_df %>%
  gather(variable, value, a, b, d) %>%
  unite(level_variable, level, variable) %>%
  spread(level_variable, value)

#> # A tibble: 26 x 12
#>       ID        p         c     high_a     high_b     high_d      low_a
#>  * <chr>    <dbl>     <dbl>      <dbl>      <dbl>      <dbl>      <dbl>
#>  1     a 358.8198 0.3517979 0.54406602 0.79892485  79.011379 0.54406602
#>  2     b 809.4746 0.1111354 0.59414202 0.12189926   9.382306 0.59414202
#>  3     c 468.0792 0.2436195 0.28915974 0.56094798  26.635096 0.28915974
#>  4     d 894.7157 0.6680556 0.14711365 0.20653139  26.151654 0.14711365
#>  5     e 946.4206 0.4176468 0.96302423 0.12753165  69.905442 0.96302423
#>  6     f 141.0008 0.7881958 0.90229905 0.75330786 108.778072 0.90229905
#>  7     g 575.2949 0.1028646 0.69070528 0.89504536  52.681362 0.69070528
#>  8     h 903.1771 0.4348927 0.79546742 0.37446278 168.480110 0.79546742
#>  9     i 596.2915 0.9849570 0.02461368 0.66511519  13.845603 0.02461368
#> 10     j 510.9533 0.8930511 0.47779597 0.09484066  29.775361 0.47779597
#> # ... with 16 more rows, and 5 more variables: low_b <dbl>, low_d <dbl>,
#> #   med_a <dbl>, med_b <dbl>, med_d <dbl>


回答2:

Consider a vectorized approach (forgive me for non-tidyverse data wrangling) where all new columns can be handled in one call. Use seed(888) before random data to reproduce output:

f1 <- function(df, vars) {
  df[paste0("d.", vars)] <- df$p * df[vars] * df$c
  return(df)
}

newpracdf <- f1(pracdf, c("low.a","high.a","med.a","med.b","low.b","high.b"))

Output

# # A tibble: 26 x 15
#       ID        p      med.a      med.b          c      low.a     low.b    high.a    high.b    d.low.a   d.high.a    d.med.a    d.med.b    d.low.b   d.high.b
#    <chr>    <dbl>      <dbl>      <dbl>      <dbl>      <dbl>     <dbl>     <dbl>     <dbl>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>
#  1     a 122.9573 0.65746601 0.43123587 0.81314570 0.52597281 0.3449887 0.7889592 0.5174830  52.587917  78.881876  65.734897  43.115909  34.492727  51.739091
#  2     b 412.0127 0.19793909 0.77148952 0.26039116 0.15835127 0.6171916 0.2375269 0.9257874  16.988630  25.482945  21.235787  82.768834  66.215068  99.322601
#  3     c 155.1248 0.30834064 0.99850558 0.57853922 0.24667251 0.7988045 0.3700088 1.1982067  22.137823  33.206735  27.672279  89.611689  71.689351 107.534027
#  4     d 715.3769 0.85517040 0.81715464 0.84196723 0.68413632 0.6537237 1.0262045 0.9805856 412.071636 618.107455 515.089546 492.191742 393.753393 590.630090
#  5     e 790.5284 0.12617255 0.59290522 0.54879020 0.10093804 0.4743242 0.1514071 0.7114863  43.790379  65.685568  54.737973 257.222588 205.778070 308.667105
#  6     f 193.6968 0.15173488 0.93054996 0.08587380 0.12138791 0.7444400 0.1820819 1.1166600   2.019104   3.028655   2.523879  15.478286  12.382629  18.573943
#  7     g 451.6000 0.88123996 0.62858787 0.12546384 0.70499197 0.5028703 1.0574880 0.7543054  39.944473  59.916709  49.930591  35.615457  28.492365  42.738548
#  8     h 342.3741 0.09952918 0.56932309 0.10980862 0.07962334 0.4554585 0.1194350 0.6831877   2.993489   4.490234   3.741861  21.404056  17.123245  25.684867
#  9     i 143.9489 0.42407685 0.94929822 0.02754267 0.33926148 0.7594386 0.5088922 1.1391579   1.345083   2.017624   1.681353   3.763718   3.010975   4.516462
# 10     j 911.8069 0.25822441 0.08934875 0.55244369 0.20657953 0.0714790 0.3098693 0.1072185 104.058645 156.087967 130.073306  45.006930  36.005544  54.008316
# # ... with 16 more rows