I have been trying to use biglm to run linear regressions on a large dataset (approx 60,000,000 lines). I want to use AIC for model selection. However I discovered when playing with biglm on smaller datasets that the AIC variables returned by biglm are different from those returned by lm. This even applies to the example in the biglm help.
data(trees)
ff<-log(Volume)~log(Girth)+log(Height)
chunk1<-trees[1:10,]
chunk2<-trees[11:20,]
chunk3<-trees[21:31,]
library(biglm)
a <- biglm(ff,chunk1)
a <- update(a,chunk2)
a <- update(a,chunk3)
AIC(a)#48.18546
a_lm <- lm(ff, trees)
AIC(a_lm)#-62.71125
Can someone please explain what is happening here? Are the AICs generated with biglm safe to use for comparing biglm models on the same dataset?
I have played around with this a bit. I am not certain, but I think the formula for
AIC
used by the packagebiglm
is:where
obs_added
is the number of observations inchunk2
plus the number of observations inchunk3
:and
n.parameters
is the number of estimated coefficients returned bysummary(a) + 1
(where the+1
is for the error term), anddeviance(a)
is the deviance of your modela
.Since I am not 100% certain my answer is correct, you can play around with the code below and see whether you can find a scenario where the proposed formula for AIC does not work. If I find any such scenarios I will attempt to modify the code below and the formula above as necessary.
EDIT
I suggested
biglm
uses the following equation forAIC
:Ben Bolker pointed out that the equation
biglm
uses forAIC
is:Ben also determined that
biglm
was not updating the first value for residual df.Given that new information, I now see that the two equations are equivalent.
First, restrict the two equations to the following, which is the only place they differ:
Re-arrange mine to account for me adding 1 to the number of parameters and then subtracting one:
Now morph my equation into Ben's:
My
3
is the same as:giving:
or:
Re-arrange:
or:
And:
((number of observations in chunk1 + obs.added) - object$df.resid)
is the same as:
Which is the same as:
It appears the equation I proposed really is the equation
biglm
uses forAIC
, just expressed in a different form.Of course, I was only able to realize this because Ben provided both the critical code and the critical explanation of the error.
tl;dr it looks to me like there is a pretty obvious bug in the AIC method for
biglm
-class objects (more specifically, in the update method), in the current (0.9-1) version, but the author of thebiglm
package is a smart, experienced guy, andbiglm
is widely used, so perhaps I'm missing something. Googling for"biglm AIC df.resid"
, it seems this has been discussed way back in 2009? Update: the package author/maintainer reports via e-mail that this is indeed a bug.Something funny seems to be going on here. The differences in AIC between models should be the same across modeling frameworks, whatever the constants that have been used and however parameters are counted (because these constants and parameter-counting should be consistent within modeling frameworks ...)
Original example:
Now fit a reduced model:
Now compare AIC values:
Check that we haven't screwed something up:
Look under the hood:
In principle this is the right formula (number of observations minus residual df should be the number of parameters fitted), but digging in, it looks like the
$df.resid
component of the objects hasn't been updated properly:Looking at
biglm:::update.biglm
, I would addright before or after the line that reads
...
This seems like a fairly obvious bug to me, so perhaps I'm missing something obvious, or perhaps it has been fixed.
A simple workaround would be to define your own AIC function as
(although note that
AIC(a_lm)
is still not equal toAIC(a)
, becausestats:::AIC.default()
uses 2*log-likelihood rather than deviance (these two measures differ in their additive coefficients) ...)