3 layer Stacked histogram from already summarized

2019-08-13 00:20发布

I would like some help coloring a ggplot2 histogram generated from summarized data in a data.frame.

The dataset I'm using is the [R] build in (USArrests) dataset.

I'm trying to adapt the solution that was given to this question by arun.

The desired result is to make a histogram of "Crime" and color each bar according to the relative contribution of c("Assault", "Rape", "Murder").

The code:

attach(USArrests)

#Create vector SUM arrests per state
Crime <- with(USArrests, Murder+ Rape+ Assault)

#bind Vector Crime to dataframe USArrets and name it USArrests.transform
USArrests.transform <- cbind (USArrests, Crime)

#See if package is installed, and do if not
if (!require("ggplot2")) {
  install.packages("ggplot2")
  library(ggplot2)
}

ggplot (data = USArrests.transform, aes(x= Crime)) + geom_histogram()
# get crime histogram plot and name it crime.plot
crime.plot <- ggplot (data = USArrests.transform, aes(x= Crime)) + geom_histogram()
# get data of crime plot: cols = count, xmin and xmax
crime.data <- ggplot_build(crime.plot)$data[[1]][c("count", "x", "xmin", "xmax")]
# add a id colum for ddply
crime.data$id <- seq(nrow(crime.data))

#See if package is installed, and do if not
if (!require("plyr")) {
  install.packages("plyr")
  library(plyr)
}

#Split data frame, apply function en return results in a data frame: ddply
crime.data.transform <- ddply(crime.data, .(id), function(x) {
  tranche <- USArrests.transform[USArrests.transform$Crime >= x$xmin & USArrests.transform$Crime <= x$xmax, ]
  if(nrow(tranche) == 0) return(c(x$x, 0, 0))
  crime.plot <- c(x=x$x, colSums(tranche)[c("Murder", "Assault", "Rape")]/colSums(tranche)["Crime"] * x$count)
})

#See if package is installed, and do if not
if (!require("reshape2")) {
  install.packages("reshape2")
  library(reshape2)
}

crime.data.transform <- melt(crime.data.transform, id.var="id")
ggplot(data = crime.data.transform, aes(x=id, y=value)) + geom_bar(aes(fill=variable), stat="identity", group=1)

[Error]: The above gives the following error:

Error in list_to_dataframe(res, attr(.data, "split_labels"), .id, id_as_factor) : 
  Results do not have equal lengths

Subsequently the are errors in part after the reshape.

Any suggestions on what I'm doing wrong and how it could be solved in the above example?

1条回答
爷、活的狠高调
2楼-- · 2019-08-13 00:44

Sorry for the long answer I felt like doing some code optimisation. Mostly the code is not yours, but even in arun's code I found some room for optimisation. Let's go through what I changed:

  1. I removed your attach statement, because it was not needed and if you work with multiple datasets it is bad practise to use attach - mainly because you loose track of your data structures
  2. If you create a sequence and the step is 1, just use : and not seq. I explained here why
  3. The error in your code: In return(c(x$x, 0, 0)) there is one zero to little.
  4. In addition you do not need x$x inside the ddply-function. Thus it should just be return(c(0,0,0)) and in the next line it needs to be c(colSums(tranche)[c("Murder", "Assault", "Rape")]. Otherwise R will plot all the x values as well.
  5. Heck! You actually do not need plyr here. This ddply-function is just a simple loop over the rows of your crime.data-data.frame. That is something you can achieve using an lapply-loop

Here I maybe need to explain a bit: The plyr-package tried to overcome the shortcomings of the apply-family-functions. Except for lapply, their behaviour is rather unpredictable. Especially sapply might return anything from vector over matrix to list-objects. Only lapply is reliable - it always gives you a list result:

USArrests_sum <- cbind (USArrests, arrests=with(USArrests, Murder+ Rape+ Assault))

#See if package is installed, and do if not
if (!require("ggplot2")) {
  install.packages("ggplot2")
  library(ggplot2)
}

# get crime histogram plot and name it crime.plot
crime.plot <- ggplot (data = USArrests_sum, aes(x= arrests)) + geom_histogram()
crime_df <- ggplot_build(crime.plot)$data[[1]][c("count", "x", "xmin", "xmax")] # get data of crime plot: cols = count, xmin and xmax
crime_df$id = 1:nrow(crime_df) #add a id colum for ddply

#Split data frame, apply function en return results in a data frame: ddply
tranche_list<-lapply(1:nrow(crime_df), function(j) {
  myrows<-(USArrests_sum$arrests >= crime_df$xmin[j] & USArrests_sum$arrests <= crime_df$xmax[j])
  tranche <- USArrests_sum[myrows,]
  if(nrow(tranche) == 0) return(c('Murder'=0,'Assault'=0,'Rape'=0))
  crime.plot <- c(colSums(tranche)[c("Murder", "Assault", "Rape")]/colSums(tranche)["arrests"] * crime_df$count[j])
})

The alternative is to use dplyr to transform your data, maybe somebody else feels like that. I prefer doing base R.

In the next step you use reshape2, the successor is tidyr. But actually the data structure is so simple. You can use base R if you like:

stack_df2<-data.frame(value=as.numeric(unlist(tranche_list)),
                      variable=names(unlist(tranche_list)),
                      id=rep(1:nrow(crime_df),each=3))

ggplot(data = stack_df2, aes(x=id, y=value)) + geom_bar(aes(fill=variable), stat="identity", group=1)

Appendix

I compared multiple functions with the ddply-solution:

plyr_fun<-function(){
  ddply(crime_df, .(id), function(x) {
    tranche <- USArrests_sum[USArrests_sum$arrests >= x$xmin & USArrests_sum$arrests <= x$xmax, ]
    if(nrow(tranche) == 0) return(c(0, 0,0))
    crime.plot <- c(colSums(tranche)[c("Murder", "Assault", "Rape")]/colSums(tranche)["arrests"] * x$count)
  })
}

apply_fun2<-function(){
  res_mat<-t(apply(crime_df, 1, function(x) {
    tranche <- USArrests_sum[USArrests_sum$arrests >= x['xmin'] & USArrests_sum$arrests <= x['xmax'], ]
    if(nrow(tranche) == 0) return(c(0, 0,0))
    crime.plot <- c(colSums(tranche)[c("Murder", "Assault", "Rape")]/colSums(tranche)["arrests"] * x['count'])
  }))
  colnames(res_mat)=c("Murder", "Assault", "Rape")
}

lapply_fun3<-function(){
  tranche_list<-lapply(1:nrow(crime_df), function(j) {
    myrows<-(USArrests_sum$arrests >= crime_df$xmin[j] & USArrests_sum$arrests <= crime_df$xmax[j])
    tranche <- USArrests_sum[myrows,]
    if(nrow(tranche) == 0) return(c(0, 0,0))
    crime.plot <- c(colSums(tranche)[c("Murder", "Assault", "Rape")]/colSums(tranche)["arrests"] * crime_df$count[j])
  })
  do.call(rbind,tranche_list)
}

lapply_fun<-function(){
  tranche_list<-lapply(1:nrow(crime_df), function(j) {
    myrows<-(USArrests_sum$arrests >= crime_df$xmin[j] & USArrests_sum$arrests <= crime_df$xmax[j])
    tranche <- USArrests_sum[myrows,]
    if(nrow(tranche) == 0) return(c('Murder'=0,'Assault'=0,'Rape'=0))
    crime.plot <- c(colSums(tranche)[c("Murder", "Assault", "Rape")]/colSums(tranche)["arrests"] * crime_df$count[j])
  })
}

microbenchmark::microbenchmark(apply_fun2(),lapply_fun3(),lapply_fun(),plyr_fun(),times=1000L)
Unit: milliseconds
          expr    min      lq      mean   median       uq      max neval
  apply_fun2() 5.2307 5.73340  7.169920  6.17165  7.27340  31.5333  1000
 lapply_fun3() 5.3633 5.98930  7.487173  6.40780  7.50115  37.1350  1000
  lapply_fun() 5.4470 5.99295  7.762575  6.43975  7.73060  82.2069  1000
    plyr_fun() 8.8593 9.83850 12.186933 10.54180 12.75880 192.6898  1000

Actually the apply-function is even faster than the lapply-solution. But readability is quite bad. Usually data.table-function are faster than the apply family, whereas dplyr-function run comparatively slow but have a good readability and are suitable for code-translations.

Just for fun - another benchmark of tidyr vs my base R solution:

tidyr_fun<-function(){
  crime_tranche<-do.call(rbind,tranche_list)
  stack_df <- gather(data.frame(crime_tranche,id=1:nrow(crime_df)), key=variable,value=value,-id)
}

base_fun<-function(){
  stack_df2<-data.frame(value=as.numeric(unlist(tranche_list)),
                        variable=names(unlist(tranche_list)),
                        id=rep(1:nrow(crime_df),each=3))
}

microbenchmark::microbenchmark(tidyr_fun(),base_fun())
Unit: microseconds
expr    min      lq     mean  median     uq    max neval
tidyr_fun() 1588.4 1869.45 2516.253 2302.35 2777.9 7671.3   100
base_fun()  286.7  367.40  530.104  454.85  612.8 3675.8   100

# In case you want to verify that the data is the same. identical(stack_df2$id[order(stack_df2$id,stack_df2$variable)],stack_df$id[order(stack_df$id,stack_df$variable)])
identical(stack_df2$value[order(stack_df2$id,stack_df2$variable)],stack_df$value[order(stack_df$id,stack_df$variable)])
identical(as.character(stack_df2$variable[order(stack_df2$id,stack_df2$variable)]),stack_df$variable[order(stack_df$id,stack_df$variable)])
查看更多
登录 后发表回答