Comparison between dplyr::do / purrr::map, what ad

2019-01-21 08:22发布

When using broom I was used to combine dplyr::group_by and dplyr::do to perform actions on grouped data thanks to @drob. For example, fitting a linear model to cars depending on their gear system:

library("dplyr")
library("tidyr")
library("broom")

# using do()
mtcars %>%
  group_by(am) %>%
  do(tidy(lm(mpg ~ wt, data = .)))

# Source: local data frame [4 x 6]
# Groups: am [2]

#     am        term  estimate std.error statistic      p.value
#   (dbl)       (chr)     (dbl)     (dbl)     (dbl)        (dbl)
# 1     0 (Intercept) 31.416055 2.9467213 10.661360 6.007748e-09
# 2     0          wt -3.785908 0.7665567 -4.938848 1.245595e-04
# 3     1 (Intercept) 46.294478 3.1198212 14.838824 1.276849e-08
# 4     1          wt -9.084268 1.2565727 -7.229401 1.687904e-05

After reading the recent post from @hadley about tidyr v0.4.1 I discovered that the same thing could be achieved using nest() and purrr::map()

Same example as before:

by_am <- mtcars %>%
  group_by(am) %>%
  nest() %>%
  mutate(model = purrr::map(data, ~ lm(mpg ~ wt, data = .)))

by_am %>%
  unnest(model %>% purrr::map(tidy))

# Source: local data frame [4 x 6]

#      am        term  estimate std.error statistic      p.value
#   (dbl)       (chr)     (dbl)     (dbl)     (dbl)        (dbl)
# 1     1 (Intercept) 46.294478 3.1198212 14.838824 1.276849e-08
# 2     1          wt -9.084268 1.2565727 -7.229401 1.687904e-05
# 3     0 (Intercept) 31.416055 2.9467213 10.661360 6.007748e-09
# 4     0          wt -3.785908 0.7665567 -4.938848 1.245595e-04

The ordering changed, but results are the same.

Given both largely address the same use case, I am wondering whether are both approaches going to be supported going forward. Will method become the canonical tidyverse way? If one is not considered canonical, what use case(s) require that both approaches continues to be supported?

From my short experience:

  • do
    • progress bar, nice when many models are computed.
    • @Axeman comment: can be parallelized using multidplyr
    • smaller object, but need to re-run if we want broom::glance fx.
  • map
    • data, subsets and models are kept within one tbl_df
    • easy to extract another component of models, even if unnest() takes a bit of time.

If you have some insights / remarks, will be happy to have some feedback.

0条回答
登录 后发表回答