I am relatively new to R, and trying to use ddply & summarise from the plyr package. This post almost, but not quite, answers my question. I could use some additional explanation/clarification.
My problem:
I want to create a simple function to summarize descriptive statistics, by group, for a given variable. Unlike the linked post, I would like to include the variable of interest as an argument to the function. As has already been discussed on this site, this works:
require(plyr)
ddply(mtcars, ~ cyl, summarise,
mean = mean(hp),
sd = sd(hp),
min = min(hp),
max = max(hp)
)
But this doesn't:
descriptives_by_group <- function(dataset, group, x)
{
ddply(dataset, ~ group, summarise,
mean = mean(x),
sd = sd(x),
min = min(x),
max = max(x)
)
}
descriptives_by_group(mtcars, cyl, hp)
Because of the volume of data with which I am working, I would like to be able to have a function that allows me to specify the variable of interest to me as well as the dataset and grouping variable.
I have tried to edit the various solutions found here to address my problem, but I don't understand the code well enough to do it successfully.
The original poster used the following example dataset:
a = c(1,2,3,4)
b = c(0,0,1,1)
c = c(5,6,7,8)
df = data.frame(a,b,c)
sv = c("b")
With the desired output:
b Ave
1 0 1.5
2 1 3.5
And the solution endorsed by Hadley was:
myFunction <- function(x, y){
NewColName <- "a"
z <- ddply(x, y, .fun = function(xx,col){
c(Ave = mean(xx[,col],na.rm=TRUE))},
NewColName)
return(z)
}
Where myFunction(df, sv)
returns the desired output.
I tried to break down the code piece-by-piece to see if, by getting a better understanding of the underlying mechanics, I could modify the code to include an argument to the function that would pass to what, in this example, is "NewColName" (the variable you want to get information about). But I am not having any success. My difficulty is that I do not understand what is happening with (xx[,col])
. I know that mean(xx[,col]) should be taking the mean of the column with index col
for the data frame xx
. But I don't understand where the anonymous function is reading those values from.
Could someone please help me parse this? I've wasted hours on a trivial task I could accomplish easily with very repetitive code and/or with subsetting, but I got hung up on trying to make my script more simple and elegant, and on understanding the "whys" of this problem and its solution(s).
PS I have looked into the describeBy function from the psych package, but as far as I can tell, it does not let you specify the variable(s) you want to return values for, and consequently does not solve my problem.