Comparison between dplyr::do / purrr::map, what ad

When using broom I was used to combine dplyr::group_by and dplyr::do to perform actions on grouped data thanks to @drob. For example, fitting a linear model to cars depending on their gear system:

library("dplyr")
library("tidyr")
library("broom")

# using do()
mtcars %>%
  group_by(am) %>%
  do(tidy(lm(mpg ~ wt, data = .)))

# Source: local data frame [4 x 6]
# Groups: am [2]

#     am        term  estimate std.error statistic      p.value
#   (dbl)       (chr)     (dbl)     (dbl)     (dbl)        (dbl)
# 1     0 (Intercept) 31.416055 2.9467213 10.661360 6.007748e-09
# 2     0          wt -3.785908 0.7665567 -4.938848 1.245595e-04
# 3     1 (Intercept) 46.294478 3.1198212 14.838824 1.276849e-08
# 4     1          wt -9.084268 1.2565727 -7.229401 1.687904e-05

After reading the recent post from @hadley about tidyr v0.4.1 I discovered that the same thing could be achieved using nest() and purrr::map()

Same example as before:

by_am <- mtcars %>%
  group_by(am) %>%
  nest() %>%
  mutate(model = purrr::map(data, ~ lm(mpg ~ wt, data = .)))

by_am %>%
  unnest(model %>% purrr::map(tidy))

# Source: local data frame [4 x 6]

#      am        term  estimate std.error statistic      p.value
#   (dbl)       (chr)     (dbl)     (dbl)     (dbl)        (dbl)
# 1     1 (Intercept) 46.294478 3.1198212 14.838824 1.276849e-08
# 2     1          wt -9.084268 1.2565727 -7.229401 1.687904e-05
# 3     0 (Intercept) 31.416055 2.9467213 10.661360 6.007748e-09
# 4     0          wt -3.785908 0.7665567 -4.938848 1.245595e-04

The ordering changed, but results are the same.

Given both largely address the same use case, I am wondering whether are both approaches going to be supported going forward. Will method become the canonical tidyverse way? If one is not considered canonical, what use case(s) require that both approaches continues to be supported?

From my short experience:

do
- progress bar, nice when many models are computed.
- @Axeman comment: can be parallelized using multidplyr
- smaller object, but need to re-run if we want broom::glance fx.
map
- data, subsets and models are kept within one tbl_df
- easy to extract another component of models, even if unnest() takes a bit of time.

If you have some insights / remarks, will be happy to have some feedback.

标签： r dplyr tidyr broom

0条回答

Comparison between dplyr::do / purrr::map, what ad

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间