summarising data frame rows based on name prefix

2019-09-19 15:45发布

问题:

I'd like to create a summary data frame that gathers all rows based on a text prefix, with an average, max and min for each variable. So in the example below, I'd like to summarise the average, min and max values for the "Jim" shops, "Jen, shops etc, as well as the same values for all of the furniture in each group of shops.

shop    tables  chairs  beds
jim-1   2   63  31
jim-2a  10  4   16
jim-2b  32  34  43
jen-1   32  90  32
jen-2   73  91  6
jen-3   35  85  65
sam-a   72  57  72
sam-b   18  48  11
sam-c   34  49  79
paul-1  43  49  23
paul-2  76  20  23
paul-2a 34  20  8

Note that some shops are 1,2,3 or a,b,c etc and that there can be a variable number of letters in a name (jim vs paul). I'd like my output to resemble:

shop_group  tables_av   tables_min  tables_max  chairs_av   chairs_min  chairs_max  beds_av beds_min    beds_max    furniture_av    furniture_min   furniture_max
jim 14.67   2.00    32.00   33.67   4.00    63.00   30.00   16.00   43.00   78.33   30.00   109.00
jen 46.67   32.00   73.00   88.67   85.00   91.00   34.33   6.00    65.00   169.67  154.00  185.00
sam 41.33   18.00   72.00   51.33   48.00   57.00   54.00   11.00   79.00   146.67  77.00   201.00
paul    51.00   34.00   76.00   29.67   20.00   49.00   18.00   8.00    23.00   98.67   62.00   119.00

Thanks in advance...

回答1:

Just construct the shop group, and use aggregate together with summary to get the output you were seeking.

shop_group = sub("-.*", "", df$shop)
aggregate(df[,2:4], list(shop_group), 
     FUN = function(x) summary(x)[c(4,1,6)])
  Group.1 tables.Mean tables.Min. tables.Max. chairs.Mean chairs.Min. chairs.Max. beds.Mean beds.Min. beds.Max.
1     jen       46.67       32.00       73.00       88.67       85.00       91.00     34.33      6.00     65.00
2     jim       14.67        2.00       32.00       33.67        4.00       63.00     30.00     16.00     43.00
3    paul       51.00       34.00       76.00       29.67       20.00       49.00     18.00      8.00     23.00
4     sam       41.33       18.00       72.00       51.33       48.00       57.00     54.00     11.00     79.00


回答2:

I still don't know why the data frame wasn't outputting properly, but just in case someone else is having the same problem, here is the solution that a colleague came up with:

library(tibble)
library(dplyr)
library(magrittr)
df %>% 
mutate(shop = gsub("-.*", "", shop)) %>%
group_by(shop) %>%
summarise_each(funs(mean, min, max)) -> summary_df


标签: r dplyr