I've data as follow, each experiment lead to the apparition of a composition, and each composition belong to one or many categories. I want to plot occurence number of each composition:
DF <- read.table(text = " Comp Category
Comp1 1
Comp2 1
Comp3 4,2
Comp4 1,3
Comp1 1,2
Comp3 3 ", header = TRUE)
barplot(table(DF$Comp))
So this worked perfectly for me.
After that, as composition belong to one or many categories. there's comma separations between categories.I Want to barplot the compo in X and nb of compo in Y, and for each bar the % of each category.
My Idea was to duplicate the line where there is comma, so to repete it N+1 the number of comma.
DF = table(DF$Category,DF$Comp)
cats <- strsplit(rownames(DF), ",", fixed = TRUE)
DF <- DF[rep(seq_len(nrow(DF)), sapply(cats, length)),]
DF <- as.data.frame(unclass(DF))
DF$cat <- unlist(cats)
DF <- aggregate(. ~ cat, DF, FUN = sum)
it will give me for example: for Comp1
1 2 3 4
Comp1 2 1 0 0
But If I apply this method, the total number of category (3) won't correspond to the total number of compositions (comp1=2).
How to proceed in such case ? is the solution is to devide by the nb of comma +1 ? if yes, how to do this in my code, and is there a simpliest way ?
Thanks a lot !
Producing your plot requires two steps, as you already noticed. First, one needs to prepare the data, then one can create the plot.
Preparing the data
You have already shown your efforts of bringing the data to a suitable form, but let me propose an alternative way.
First, I have to make sure that the Category
column of the data frame is a character and not a factor. I also store a vector of all the categories that appear in the data frame:
DF$Category <- as.character(DF$Category)
cats <- unique(unlist(strsplit(DF$Category, ",")))
I then need to summarise the data. For this purpose, I need a function that gives for each value in Comp
the percentage for each category scaled such, that the sum of values gives the number of rows in the original data with that Comp
.
The following function returns this information for the entire data frame in the form of another data frame (the output needs to be a data frame, because I want to use the function with do()
later).
cat_perc <- function(cats, vec) {
# percentages
nums <- sapply(cats, function(cat) sum(grepl(cat, vec)))
perc <- nums/sum(nums)
final <- perc * length(vec)
df <- as.data.frame(as.list(final))
names(df) <- cats
return(df)
}
Running the function on the complete data frame gives:
cat_perc(cats, DF$Category)
## 1 4 2 3
## 1 2.666667 0.6666667 1.333333 1.333333
The values sum up to six, which is indeed the total number of rows in the original data frame.
Now we want to run that function for each value of Comp
, which can be done using the dplyr
package:
library(dplyr)
plot_data <-
group_by(DF, Comp) %>%
do(cat_perc(cats, .$Category))
plot_data
## plot_data
## Source: local data frame [4 x 5]
## Groups: Comp [4]
##
## Comp 1 4 2 3
## (fctr) (dbl) (dbl) (dbl) (dbl)
## 1 Comp1 1.333333 0.0000000 0.6666667 0.0000000
## 2 Comp2 1.000000 0.0000000 0.0000000 0.0000000
## 3 Comp3 0.000000 0.6666667 0.6666667 0.6666667
## 4 Comp4 0.500000 0.0000000 0.0000000 0.5000000
This first groups the data by Comp
and then applies the function cat_perc
to only the subset of the data frame with a given Comp
.
I will plot the data with the ggplot2
package, which requires the data to be in the so-called long format. This means that each data point to be plotted should correspond to a row in the data frame. (As it is now, each row contains 4 data points.) This can be done with the tidyr
package as follows:
library(tidyr)
plot_data <- gather(plot_data, Category, value, -Comp)
head(plot_data)
## Source: local data frame [6 x 3]
## Groups: Comp [4]
##
## Comp Category value
## (fctr) (chr) (dbl)
## 1 Comp1 1 1.333333
## 2 Comp2 1 1.000000
## 3 Comp3 1 0.000000
## 4 Comp4 1 0.500000
## 5 Comp1 4 0.000000
## 6 Comp2 4 0.000000
As you can see, there is now a single data point per row, characterised by Comp
, Category
and the corresponding value
.
Plotting the data
Now that everything is read, we can plot the data using ggplot
:
library(ggplot2)
ggplot(plot_data, aes(x = Comp, y = value, fill = Category)) +
geom_bar(stat = "identity")