I've data as follow, each experiment lead to the apparition of a composition, and each composition belong to one or many categories. I want to plot occurence number of each composition:
DF <- read.table(text = " Comp Category
Comp1 1
Comp2 1
Comp3 4,2
Comp4 1,3
Comp1 1,2
Comp3 3 ", header = TRUE)
barplot(table(DF$Comp))
So this worked perfectly for me.
After that, as composition belong to one or many categories. there's comma separations between categories.I Want to barplot the compo in X and nb of compo in Y, and for each bar the % of each category.
My Idea was to duplicate the line where there is comma, so to repete it N+1 the number of comma.
DF = table(DF$Category,DF$Comp)
cats <- strsplit(rownames(DF), ",", fixed = TRUE)
DF <- DF[rep(seq_len(nrow(DF)), sapply(cats, length)),]
DF <- as.data.frame(unclass(DF))
DF$cat <- unlist(cats)
DF <- aggregate(. ~ cat, DF, FUN = sum)
it will give me for example: for Comp1
1 2 3 4
Comp1 2 1 0 0
But If I apply this method, the total number of category (3) won't correspond to the total number of compositions (comp1=2).
How to proceed in such case ? is the solution is to devide by the nb of comma +1 ? if yes, how to do this in my code, and is there a simpliest way ?
Thanks a lot !
Producing your plot requires two steps, as you already noticed. First, one needs to prepare the data, then one can create the plot.
Preparing the data
You have already shown your efforts of bringing the data to a suitable form, but let me propose an alternative way.
First, I have to make sure that the
Category
column of the data frame is a character and not a factor. I also store a vector of all the categories that appear in the data frame:I then need to summarise the data. For this purpose, I need a function that gives for each value in
Comp
the percentage for each category scaled such, that the sum of values gives the number of rows in the original data with thatComp
.The following function returns this information for the entire data frame in the form of another data frame (the output needs to be a data frame, because I want to use the function with
do()
later).Running the function on the complete data frame gives:
The values sum up to six, which is indeed the total number of rows in the original data frame.
Now we want to run that function for each value of
Comp
, which can be done using thedplyr
package:This first groups the data by
Comp
and then applies the functioncat_perc
to only the subset of the data frame with a givenComp
.I will plot the data with the
ggplot2
package, which requires the data to be in the so-called long format. This means that each data point to be plotted should correspond to a row in the data frame. (As it is now, each row contains 4 data points.) This can be done with thetidyr
package as follows:As you can see, there is now a single data point per row, characterised by
Comp
,Category
and the correspondingvalue
.Plotting the data
Now that everything is read, we can plot the data using
ggplot
: