I would like some help coloring a ggplot2 histogram generated from summarized data in a data.frame.
The dataset I'm using is the [R] build in (USArrests) dataset.
I'm trying to adapt the solution that was given to this question by arun.
The desired result is to make a histogram of "Crime" and color each bar according to the relative contribution of c("Assault", "Rape", "Murder").
The code:
attach(USArrests)
#Create vector SUM arrests per state
Crime <- with(USArrests, Murder+ Rape+ Assault)
#bind Vector Crime to dataframe USArrets and name it USArrests.transform
USArrests.transform <- cbind (USArrests, Crime)
#See if package is installed, and do if not
if (!require("ggplot2")) {
install.packages("ggplot2")
library(ggplot2)
}
ggplot (data = USArrests.transform, aes(x= Crime)) + geom_histogram()
# get crime histogram plot and name it crime.plot
crime.plot <- ggplot (data = USArrests.transform, aes(x= Crime)) + geom_histogram()
# get data of crime plot: cols = count, xmin and xmax
crime.data <- ggplot_build(crime.plot)$data[[1]][c("count", "x", "xmin", "xmax")]
# add a id colum for ddply
crime.data$id <- seq(nrow(crime.data))
#See if package is installed, and do if not
if (!require("plyr")) {
install.packages("plyr")
library(plyr)
}
#Split data frame, apply function en return results in a data frame: ddply
crime.data.transform <- ddply(crime.data, .(id), function(x) {
tranche <- USArrests.transform[USArrests.transform$Crime >= x$xmin & USArrests.transform$Crime <= x$xmax, ]
if(nrow(tranche) == 0) return(c(x$x, 0, 0))
crime.plot <- c(x=x$x, colSums(tranche)[c("Murder", "Assault", "Rape")]/colSums(tranche)["Crime"] * x$count)
})
#See if package is installed, and do if not
if (!require("reshape2")) {
install.packages("reshape2")
library(reshape2)
}
crime.data.transform <- melt(crime.data.transform, id.var="id")
ggplot(data = crime.data.transform, aes(x=id, y=value)) + geom_bar(aes(fill=variable), stat="identity", group=1)
[Error]: The above gives the following error:
Error in list_to_dataframe(res, attr(.data, "split_labels"), .id, id_as_factor) :
Results do not have equal lengths
Subsequently the are errors in part after the reshape.
Any suggestions on what I'm doing wrong and how it could be solved in the above example?
Sorry for the long answer I felt like doing some code optimisation. Mostly the code is not yours, but even in arun's code I found some room for optimisation. Let's go through what I changed:
- I removed your
attach
statement, because it was not needed and if you work with multiple datasets it is bad practise to use attach
- mainly because you loose track of your data structures
- If you create a sequence and the step is 1, just use
:
and not seq
. I explained here why
- The error in your code: In
return(c(x$x, 0, 0))
there is one zero to little.
- In addition you do not need
x$x
inside the ddply
-function. Thus it should just be return(c(0,0,0))
and in the next line it needs to be c(colSums(tranche)[c("Murder", "Assault", "Rape")]
. Otherwise R will plot all the x
values as well.
- Heck! You actually do not need
plyr
here. This ddply
-function is just a simple loop over the rows of your crime.data
-data.frame. That is something you can achieve using an lapply
-loop
Here I maybe need to explain a bit: The plyr
-package tried to overcome the shortcomings of the apply
-family-functions. Except for lapply
, their behaviour is rather unpredictable. Especially sapply
might return anything from vector
over matrix
to list
-objects. Only lapply
is reliable - it always gives you a list
result:
USArrests_sum <- cbind (USArrests, arrests=with(USArrests, Murder+ Rape+ Assault))
#See if package is installed, and do if not
if (!require("ggplot2")) {
install.packages("ggplot2")
library(ggplot2)
}
# get crime histogram plot and name it crime.plot
crime.plot <- ggplot (data = USArrests_sum, aes(x= arrests)) + geom_histogram()
crime_df <- ggplot_build(crime.plot)$data[[1]][c("count", "x", "xmin", "xmax")] # get data of crime plot: cols = count, xmin and xmax
crime_df$id = 1:nrow(crime_df) #add a id colum for ddply
#Split data frame, apply function en return results in a data frame: ddply
tranche_list<-lapply(1:nrow(crime_df), function(j) {
myrows<-(USArrests_sum$arrests >= crime_df$xmin[j] & USArrests_sum$arrests <= crime_df$xmax[j])
tranche <- USArrests_sum[myrows,]
if(nrow(tranche) == 0) return(c('Murder'=0,'Assault'=0,'Rape'=0))
crime.plot <- c(colSums(tranche)[c("Murder", "Assault", "Rape")]/colSums(tranche)["arrests"] * crime_df$count[j])
})
The alternative is to use dplyr
to transform your data, maybe somebody else feels like that. I prefer doing base R
.
In the next step you use reshape2
, the successor is tidyr
. But actually the data structure is so simple. You can use base R
if you like:
stack_df2<-data.frame(value=as.numeric(unlist(tranche_list)),
variable=names(unlist(tranche_list)),
id=rep(1:nrow(crime_df),each=3))
ggplot(data = stack_df2, aes(x=id, y=value)) + geom_bar(aes(fill=variable), stat="identity", group=1)
Appendix
I compared multiple functions with the ddply
-solution:
plyr_fun<-function(){
ddply(crime_df, .(id), function(x) {
tranche <- USArrests_sum[USArrests_sum$arrests >= x$xmin & USArrests_sum$arrests <= x$xmax, ]
if(nrow(tranche) == 0) return(c(0, 0,0))
crime.plot <- c(colSums(tranche)[c("Murder", "Assault", "Rape")]/colSums(tranche)["arrests"] * x$count)
})
}
apply_fun2<-function(){
res_mat<-t(apply(crime_df, 1, function(x) {
tranche <- USArrests_sum[USArrests_sum$arrests >= x['xmin'] & USArrests_sum$arrests <= x['xmax'], ]
if(nrow(tranche) == 0) return(c(0, 0,0))
crime.plot <- c(colSums(tranche)[c("Murder", "Assault", "Rape")]/colSums(tranche)["arrests"] * x['count'])
}))
colnames(res_mat)=c("Murder", "Assault", "Rape")
}
lapply_fun3<-function(){
tranche_list<-lapply(1:nrow(crime_df), function(j) {
myrows<-(USArrests_sum$arrests >= crime_df$xmin[j] & USArrests_sum$arrests <= crime_df$xmax[j])
tranche <- USArrests_sum[myrows,]
if(nrow(tranche) == 0) return(c(0, 0,0))
crime.plot <- c(colSums(tranche)[c("Murder", "Assault", "Rape")]/colSums(tranche)["arrests"] * crime_df$count[j])
})
do.call(rbind,tranche_list)
}
lapply_fun<-function(){
tranche_list<-lapply(1:nrow(crime_df), function(j) {
myrows<-(USArrests_sum$arrests >= crime_df$xmin[j] & USArrests_sum$arrests <= crime_df$xmax[j])
tranche <- USArrests_sum[myrows,]
if(nrow(tranche) == 0) return(c('Murder'=0,'Assault'=0,'Rape'=0))
crime.plot <- c(colSums(tranche)[c("Murder", "Assault", "Rape")]/colSums(tranche)["arrests"] * crime_df$count[j])
})
}
microbenchmark::microbenchmark(apply_fun2(),lapply_fun3(),lapply_fun(),plyr_fun(),times=1000L)
Unit: milliseconds
expr min lq mean median uq max neval
apply_fun2() 5.2307 5.73340 7.169920 6.17165 7.27340 31.5333 1000
lapply_fun3() 5.3633 5.98930 7.487173 6.40780 7.50115 37.1350 1000
lapply_fun() 5.4470 5.99295 7.762575 6.43975 7.73060 82.2069 1000
plyr_fun() 8.8593 9.83850 12.186933 10.54180 12.75880 192.6898 1000
Actually the apply
-function is even faster than the lapply
-solution. But readability is quite bad. Usually data.table
-function are faster than the apply
family, whereas dplyr
-function run comparatively slow but have a good readability and are suitable for code-translations.
Just for fun - another benchmark of tidyr
vs my base R solution:
tidyr_fun<-function(){
crime_tranche<-do.call(rbind,tranche_list)
stack_df <- gather(data.frame(crime_tranche,id=1:nrow(crime_df)), key=variable,value=value,-id)
}
base_fun<-function(){
stack_df2<-data.frame(value=as.numeric(unlist(tranche_list)),
variable=names(unlist(tranche_list)),
id=rep(1:nrow(crime_df),each=3))
}
microbenchmark::microbenchmark(tidyr_fun(),base_fun())
Unit: microseconds
expr min lq mean median uq max neval
tidyr_fun() 1588.4 1869.45 2516.253 2302.35 2777.9 7671.3 100
base_fun() 286.7 367.40 530.104 454.85 612.8 3675.8 100
# In case you want to verify that the data is the same. identical(stack_df2$id[order(stack_df2$id,stack_df2$variable)],stack_df$id[order(stack_df$id,stack_df$variable)])
identical(stack_df2$value[order(stack_df2$id,stack_df2$variable)],stack_df$value[order(stack_df$id,stack_df$variable)])
identical(as.character(stack_df2$variable[order(stack_df2$id,stack_df2$variable)]),stack_df$variable[order(stack_df$id,stack_df$variable)])