I would like some help coloring a ggplot2 histogram generated from summarized data in a data.frame.
The dataset I'm using is the [R] build in (USArrests) dataset.
I'm trying to adapt the solution that was given to this question by arun.
The desired result is to make a histogram of "Crime" and color each bar according to the relative contribution of c("Assault", "Rape", "Murder").
The code:
attach(USArrests)
#Create vector SUM arrests per state
Crime <- with(USArrests, Murder+ Rape+ Assault)
#bind Vector Crime to dataframe USArrets and name it USArrests.transform
USArrests.transform <- cbind (USArrests, Crime)
#See if package is installed, and do if not
if (!require("ggplot2")) {
install.packages("ggplot2")
library(ggplot2)
}
ggplot (data = USArrests.transform, aes(x= Crime)) + geom_histogram()
# get crime histogram plot and name it crime.plot
crime.plot <- ggplot (data = USArrests.transform, aes(x= Crime)) + geom_histogram()
# get data of crime plot: cols = count, xmin and xmax
crime.data <- ggplot_build(crime.plot)$data[[1]][c("count", "x", "xmin", "xmax")]
# add a id colum for ddply
crime.data$id <- seq(nrow(crime.data))
#See if package is installed, and do if not
if (!require("plyr")) {
install.packages("plyr")
library(plyr)
}
#Split data frame, apply function en return results in a data frame: ddply
crime.data.transform <- ddply(crime.data, .(id), function(x) {
tranche <- USArrests.transform[USArrests.transform$Crime >= x$xmin & USArrests.transform$Crime <= x$xmax, ]
if(nrow(tranche) == 0) return(c(x$x, 0, 0))
crime.plot <- c(x=x$x, colSums(tranche)[c("Murder", "Assault", "Rape")]/colSums(tranche)["Crime"] * x$count)
})
#See if package is installed, and do if not
if (!require("reshape2")) {
install.packages("reshape2")
library(reshape2)
}
crime.data.transform <- melt(crime.data.transform, id.var="id")
ggplot(data = crime.data.transform, aes(x=id, y=value)) + geom_bar(aes(fill=variable), stat="identity", group=1)
[Error]: The above gives the following error:
Error in list_to_dataframe(res, attr(.data, "split_labels"), .id, id_as_factor) :
Results do not have equal lengths
Subsequently the are errors in part after the reshape.
Any suggestions on what I'm doing wrong and how it could be solved in the above example?
Sorry for the long answer I felt like doing some code optimisation. Mostly the code is not yours, but even in arun's code I found some room for optimisation. Let's go through what I changed:
attach
statement, because it was not needed and if you work with multiple datasets it is bad practise to useattach
- mainly because you loose track of your data structures:
and notseq
. I explained here whyreturn(c(x$x, 0, 0))
there is one zero to little.x$x
inside theddply
-function. Thus it should just bereturn(c(0,0,0))
and in the next line it needs to bec(colSums(tranche)[c("Murder", "Assault", "Rape")]
. Otherwise R will plot all thex
values as well.plyr
here. Thisddply
-function is just a simple loop over the rows of yourcrime.data
-data.frame. That is something you can achieve using anlapply
-loopHere I maybe need to explain a bit: The
plyr
-package tried to overcome the shortcomings of theapply
-family-functions. Except forlapply
, their behaviour is rather unpredictable. Especiallysapply
might return anything fromvector
overmatrix
tolist
-objects. Onlylapply
is reliable - it always gives you alist
result:The alternative is to use
dplyr
to transform your data, maybe somebody else feels like that. I prefer doingbase R
.In the next step you use
reshape2
, the successor istidyr
. But actually the data structure is so simple. You can usebase R
if you like:Appendix
I compared multiple functions with the
ddply
-solution:Actually the
apply
-function is even faster than thelapply
-solution. But readability is quite bad. Usuallydata.table
-function are faster than theapply
family, whereasdplyr
-function run comparatively slow but have a good readability and are suitable for code-translations.Just for fun - another benchmark of
tidyr
vs my base R solution: