I am trying to replicate a table often used in official statistics but no success so far. Given a dataframe like this one:
d1 <- data.frame( StudentID = c("x1", "x10", "x2",
"x3", "x4", "x5", "x6", "x7", "x8", "x9"),
StudentGender = c('F', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'M', 'M'),
ExamenYear = c('2007','2007','2007','2008','2008','2008','2008','2009','2009','2009'),
Exam = c('algebra', 'stats', 'bio', 'algebra', 'algebra', 'stats', 'stats', 'algebra', 'bio', 'bio'),
participated = c('no','yes','yes','yes','no','yes','yes','yes','yes','yes'),
passed = c('no','yes','yes','yes','no','yes','yes','yes','no','yes'),
stringsAsFactors = FALSE)
I would like to create a table showing PER YEAR , the number of all students (all) and those who are female, those who participated and those who passed. Please note "ofwhich" below refers to all students.
A table I have in mind would look like that:
cbind(All = table(d1$ExamenYear),
participated = table(d1$ExamenYear, d1$participated)[,2],
ofwhichFemale = table(d1$ExamenYear, d1$StudentGender)[,1],
ofwhichpassed = table(d1$ExamenYear, d1$passed)[,2])
I am sure there is a better way to this kind of thing in R.
Note: I have seen LaTex solutions, but I am not use this will work for me as I need to export the table in Excel .
Thanks in advance
Using
plyr
:Which gives:
The
plyr
package is great for this sort of thing. First load the packageThen we use the
ddply
function:Basically, ddply expects a dataframe as input and returns a data frame. We then split up the input data frame by
ExamenYear
. On each sub table we calculate a few summary statistics. Notice that in ddply, we don't have to use the$
notation when referring to columns.You may also want to take a look of the plyr's next iterator: dplyr
It uses a ggplot-like syntax and provide fast performance by writing key pieces in C++.
There could have been a couple of modifications (use
with
to reduce the number ofdf$
calls and use character indices to improve self-documentation) to your code that would have made it easier to read and a worthy competitor to theddply
solutions:I would expect this to be much faster than the ddply solution, although that will only be apparent if you are working on larger datasets.