Plot many categories

2019-09-09 20:26发布

问题:

I've data as follow, each experiment lead to the apparition of a composition, and each composition belong to one or many categories. I want to plot occurence number of each composition:

DF <- read.table(text = " Comp         Category

Comp1             1
Comp2             1   
Comp3             4,2
Comp4             1,3
Comp1             1,2
Comp3             3 ", header = TRUE)

barplot(table(DF$Comp))

So this worked perfectly for me.

After that, as composition belong to one or many categories. there's comma separations between categories.I Want to barplot the compo in X and nb of compo in Y, and for each bar the % of each category.

My Idea was to duplicate the line where there is comma, so to repete it N+1 the number of comma.

DF = table(DF$Category,DF$Comp)
cats <- strsplit(rownames(DF), ",", fixed = TRUE)
DF <- DF[rep(seq_len(nrow(DF)), sapply(cats, length)),]
DF <- as.data.frame(unclass(DF))
DF$cat <- unlist(cats)
DF <- aggregate(. ~ cat, DF, FUN = sum)

it will give me for example: for Comp1

          1     2     3     4
Comp1     2     1     0     0

But If I apply this method, the total number of category (3) won't correspond to the total number of compositions (comp1=2).

How to proceed in such case ? is the solution is to devide by the nb of comma +1 ? if yes, how to do this in my code, and is there a simpliest way ?

Thanks a lot !

回答1:

Producing your plot requires two steps, as you already noticed. First, one needs to prepare the data, then one can create the plot.

Preparing the data

You have already shown your efforts of bringing the data to a suitable form, but let me propose an alternative way.

First, I have to make sure that the Category column of the data frame is a character and not a factor. I also store a vector of all the categories that appear in the data frame:

DF$Category <- as.character(DF$Category)
cats <- unique(unlist(strsplit(DF$Category, ",")))

I then need to summarise the data. For this purpose, I need a function that gives for each value in Comp the percentage for each category scaled such, that the sum of values gives the number of rows in the original data with that Comp.

The following function returns this information for the entire data frame in the form of another data frame (the output needs to be a data frame, because I want to use the function with do() later).

cat_perc <- function(cats, vec) {
  # percentages
  nums <- sapply(cats, function(cat) sum(grepl(cat, vec)))
  perc <- nums/sum(nums)
  final <- perc * length(vec)
  df <- as.data.frame(as.list(final))
  names(df) <- cats
  return(df)
}

Running the function on the complete data frame gives:

cat_perc(cats, DF$Category)
##          1         4        2        3
## 1 2.666667 0.6666667 1.333333 1.333333

The values sum up to six, which is indeed the total number of rows in the original data frame.

Now we want to run that function for each value of Comp, which can be done using the dplyr package:

library(dplyr)
plot_data <-
group_by(DF, Comp) %>%
  do(cat_perc(cats, .$Category))
plot_data
## plot_data
## Source: local data frame [4 x 5]
## Groups: Comp [4]
## 
##     Comp        1         4         2         3
##   (fctr)    (dbl)     (dbl)     (dbl)     (dbl)
## 1  Comp1 1.333333 0.0000000 0.6666667 0.0000000
## 2  Comp2 1.000000 0.0000000 0.0000000 0.0000000
## 3  Comp3 0.000000 0.6666667 0.6666667 0.6666667
## 4  Comp4 0.500000 0.0000000 0.0000000 0.5000000

This first groups the data by Comp and then applies the function cat_perc to only the subset of the data frame with a given Comp.

I will plot the data with the ggplot2 package, which requires the data to be in the so-called long format. This means that each data point to be plotted should correspond to a row in the data frame. (As it is now, each row contains 4 data points.) This can be done with the tidyr package as follows:

library(tidyr)
plot_data <-  gather(plot_data, Category, value, -Comp)
head(plot_data)
## Source: local data frame [6 x 3]
## Groups: Comp [4]
## 
##     Comp Category    value
##   (fctr)    (chr)    (dbl)
## 1  Comp1        1 1.333333
## 2  Comp2        1 1.000000
## 3  Comp3        1 0.000000
## 4  Comp4        1 0.500000
## 5  Comp1        4 0.000000
## 6  Comp2        4 0.000000

As you can see, there is now a single data point per row, characterised by Comp, Category and the corresponding value.

Plotting the data

Now that everything is read, we can plot the data using ggplot:

library(ggplot2)
ggplot(plot_data, aes(x = Comp, y = value, fill = Category)) +
  geom_bar(stat = "identity")