I have a dataset which contains two columns, user_type, and lag response time (in days):
user_type imp_date lag
Consumer 20130613 1
Consumer 20130612 2
Consumer 20130611 3
Consumer 20130612 3
Producer 20130610 10
Producer 20130614 5
Producer 20130613 7
I would like to calculate for the percentage break down of lag for EACH user_type. Here is an example of the output I would like:
user_type lag percentage
--------- --- ----------
Consumer 1 0.25
Consumer 2 0.25
Consumer 3 0.5
Producer 5 0.333
Producer 7 0.333
Producer 10 0.333
The percentage breakdown of lag time response is calculated with respect to the total of each user_type group.
Specifically, I would like to use ddply in pylr, and I have something along the line like:
a = ddply(data, .(user_type), summarize, table(lag)/length(lag))
but it's not giving me the lag time response column.
p.s. My original motivation was to plot these lag distribution for different user type, and I have:
p <- ggplot(data, aes(x = lag, fill = factor(user_type)))
p + geom_bar(aes(y = (..count..)/sum(..count..)))
but it seems like the percentage breakdown for lag for each user_type is incorrect (i.e. The percentage was calculated with respect to each of the lag group, not user_type group). As a result, I decided to transform my dataset before plotting, if there is an easier way, please share.
Thanks!