Can't get aggregate() work for regression by g

2019-01-29 00:26发布

问题:

I want to use aggregate with this custom function:

#linear regression f-n
CalculateLinRegrDiff = function (sample){
  fit <- lm(value~ date, data = sample)
  diff(range(fit$fitted))
}

dataset2 = aggregate(value ~ id + col, dataset, CalculateLinRegrDiff(dataset))

I receive the error:

Error in get(as.character(FUN), mode = "function", envir = envir) : 
  object 'FUN' of mode 'function' was not found

What is wrong?

回答1:

Your syntax on using aggregate is wrong in the first place. Pass function CalculateLinRegrDiff not an evaluated one CalculateLinRegrDiff(dataset) to FUN argument.

Secondly, you've chosen the wrong tool. aggregate can't help you fit a regression by group. It splits the vector on the LHS of ~ according to combinations on the RHS, and then apply FUN on the LHS. That is, FUN should be a function that works with an atomic vector not a data frame. Say, mean, sd, quantile, etc are all functions that take atomic vector as input. CalculateLinRegrDiff expects a data frame input and that is not going to work with aggregate.

Note that sometimes we use cbind on the LHS, like cbind(x, y) ~ f. This means that we apply FUN in parallel to x ~ f and y ~ f. The LHS variables are independent and not used together.

The right tool for you is the by function. It splits a data frame into sub data frames and applies FUN on each sub frame. So it is ideal for regression by group.

by(dataset[c("value", "date")], dataset[c("id", "col")], CalculateLinRegrDiff)

A simple reproducible example:

set.seed(0)
dataset <- data.frame(value = runif(20), date = runif(20),
                      f = sample(gl(2, 10)), g = sample(gl(4, 5)))
oo <- by(dataset[c("value", "date")], dataset[c("f", "g")], CalculateLinRegrDiff)
str(oo)
# by [1:2, 1:4] 0.307 0.251 0.109 0.201 0.472 ...
# - attr(*, "dimnames")=List of 2
#  ..$ f: chr [1:2] "1" "2"
#  ..$ g: chr [1:4] "1" "2" "3" "4"

Since CalculateLinRegrDiff is a scalar function that returns a single scalar, by will simplify the result oo to an array rather than a list. This array is like a contingency table, so we can use the "table" method of as.data.frame to reshape it to a data frame:

oo <- as.data.frame.table(oo)
#  f g      Freq
#1 1 1 0.3069877
#2 2 1 0.2508591
#3 1 2 0.1087895
#4 2 2 0.2007295
#5 1 3 0.4715680
#6 2 3 0.4942069
#7 1 4 0.3223174
#8 2 4 0.4687340

The name "Freq" may be undesired but you can easily change it. Say names(oo)[3] <- "foo".

As I said in my comments under your question, we can also use split and lapply. But then there is no trivial way to convert the result into a good-looking data frame.

datlist <- split(dataset[c("value", "date")], dataset[c("f", "g")], drop = TRUE)
rr <- lapply(datlist, CalculateLinRegrDiff)
stack(rr)
#     values ind
#1 0.3069877 1.1
#2 0.2508591 2.1
#3 0.1087895 1.2
#4 0.2007295 2.2
#5 0.4715680 1.3
#6 0.4942069 2.3
#7 0.3223174 1.4
#8 0.4687340 2.4

I suggest you read Linear Regression and group by in R for a thorough demonstrations on regression by group.