dplyr: Find mean for each bin by groups

2020-03-08 06:51发布

问题:

I am trying to understand dplyr. I am splitting values in my data frame by group, bins and by sign, and I am trying to get a mean value for each group/bin/sign combination. I would like to output a data frame with these counts per each group/bin/sign combination, and the total numbers per each group. I think I have it but sometimes I get different values in base R compared to the output of ddplyr. Am I doing this correctly? It is also very contorted...is there a more direct way?

library(ggplot2)
df <-  data.frame(
id = sample(LETTERS[1:3], 100, replace=TRUE),
tobin = rnorm(1000),
value = rnorm(1000)
)
df$tobin[sample(nrow(df), 10)]=0

df$bin = cut_interval(abs(df$tobin), length=1)
df$sign = ifelse(df$tobin==0, "NULL", ifelse(df$tobin>0, "-", "+"))


# Find mean of value by group, bin, and sign using dplyr
library(dplyr)
res <- df %>% group_by(id, bin, sign) %>%
        summarise(Num = length(bin), value=mean(value,na.rm=TRUE))

        res %>% group_by(id) %>%
                summarise(total= sum(Num))
            res=data.frame(res)
            total=data.frame(total)
            res$total = total[match(res$id, total$id),"total"]            

res[res$id=="A" & res$bin=="[0,1]" & res$sign=="NULL",]

# Check in base R if mean by group, bin, and sign is correct # Sometimes not?
groupA = df[df$id=="A" & df$bin=="[0,1]" & df$sign=="NULL",]
mean(groupA$value, na.rm=T)

I am going crazy because it doesn't work on my data, and this command just repeats the mean of the whole dataset:

ddply(df, .(id, bin, sign), summarize, mean = mean(value,na.rm=TRUE))

Where mean is equal to mean(value,na.rm=TRUE), completely ignoring the grouping...All the groups are factors, and the value is numeric...

This however works:

with(df, aggregate(df$value, by = list(id, bin, sign), FUN = function(x) c(mean(x))))

Please help me..

回答1:

You seem to be flailing a bit. You've got correct code, then you've got extra code.

Starting from a fresh R session and defining your data, then

library(dplyr)
res <- df %>% group_by(id, bin, sign) %>%
        summarise(Num = n(), value = mean(value,na.rm=TRUE))

The above code is from your question, but I replaced length(bin) with the built-in dplyr::n() function. The above code accurately gives the group-wise averages:

head(res)
#   id   bin sign Num       value
# 1  A [0,1]    - 122 -0.08330338
# 2  A [0,1]    + 111  0.11394381
# 3  A [0,1] NULL   2  0.75232462
# 4  A (1,2]    -  54 -0.09236725
# 5  A (1,2]    +  45  0.20581095
# 6  A (2,3]    -  12 -0.08998771

Jumping ahead to your last couple lines in the code block:

groupA = df[df$id=="A" & df$bin=="[0, 1]" & df$sign=="NULL", ]
# mean(groupA$value, na.rm=T)
# [1] 0.7523246

Which matches the 3rd line of the above results. So you did it, it works fine!

The rest of your code is confused:

res %>% group_by(id) %>%
                summarise(total= sum(Num))

I'm not sure what you're trying to accomplish with this, but you don't assign it to anything so it is run but not saved.

As for your ddply attempt:

ddply(df, .(id, bin, sign), summarize, mean = mean(value,na.rm=TRUE))

You'll notice that if you have dplyr loaded and then load the plyr library, there's a message that:

You have loaded plyr after dplyr - this is likely to cause problems. If you need functions from both plyr and dplyr, please load plyr first, then dplyr: library(plyr); library(dplyr)

Do not ignore this warning! My guess is this happened, you ignored it, and that's part of the source of your troubles. Probably you don't need plyr at all, but if you do, load it before dplyr!



标签: r dplyr