ddply in R: for each group, find the percentage of

2019-05-06 17:13发布

问题:

I have a dataset which contains two columns, user_type, and lag response time (in days):

          user_type imp_date lag 
           Consumer 20130613   1  
           Consumer 20130612   2  
           Consumer 20130611   3  
           Consumer 20130612   3  
           Producer 20130610  10  
           Producer 20130614   5  
           Producer 20130613   7  

I would like to calculate for the percentage break down of lag for EACH user_type. Here is an example of the output I would like:

user_type        lag    percentage
---------        ---    ----------
Consumer         1      0.25
Consumer         2      0.25
Consumer         3      0.5
Producer         5      0.333
Producer         7      0.333
Producer         10     0.333

The percentage breakdown of lag time response is calculated with respect to the total of each user_type group.

Specifically, I would like to use ddply in pylr, and I have something along the line like:

a = ddply(data, .(user_type), summarize, table(lag)/length(lag))

but it's not giving me the lag time response column.

p.s. My original motivation was to plot these lag distribution for different user type, and I have:

p <- ggplot(data, aes(x = lag, fill = factor(user_type))) 
p + geom_bar(aes(y = (..count..)/sum(..count..)))

but it seems like the percentage breakdown for lag for each user_type is incorrect (i.e. The percentage was calculated with respect to each of the lag group, not user_type group). As a result, I decided to transform my dataset before plotting, if there is an easier way, please share.

Thanks!

回答1:

This could be done using ddply with:

a = ddply(data, .(user_type), function(d) {
    data.frame(table(d$lag)/length(d$lag))
})

Though I would probably use the data.table package, like so:

library(data.table)
d = data.table(data)
a = d[, list(lag=unique(lag), percentage=as.numeric(table(lag)/length(lag))), by="user_type"]


标签: r ggplot2 plyr