I know that the support for linear models with multiple LHS is limited. But when it is possible to run a function on an "mlm" object, I would expect the results to be trusty. When using rstudent
, strange results are produced. Is this a bug or is there some other explanation?
In the example below fittedA
and fittedB
are identical, but in the case of rstudent
the 2nd column differs.
y <- matrix(rnorm(20), 10, 2)
x <- 1:10
fittedA <- fitted(lm(y ~ x))
fittedB <- cbind(fitted(lm(y[, 1] ~ x)), fitted(lm(y[, 2] ~ x)))
rstudentA <- rstudent(lm(y ~ x))
rstudentB <- cbind(rstudent(lm(y[, 1] ~ x)), rstudent(lm(y[, 2] ~ x)))
Setup
set.seed(0)
y <- matrix(rnorm(20), 10, 2)
x <- 1:10
fit <- lm(y ~ x) ## class: "mlm", "lm"
fit1 <- lm(y[, 1] ~ x) ## class: "lm"
fit2 <- lm(y[, 2] ~ x) ## class: "lm"
rstudent(fit)
# [,1] [,2]
#1 0.74417620 0.89121744
#2 -0.67506054 -0.50275275
#3 0.76297805 -0.74363941
#4 0.71164461 0.01075898
#5 0.03337192 0.03355209
#6 -1.75099724 -0.02701558
#7 -1.05594284 0.56993056
#8 -0.48486883 -0.35286612
#9 -0.23468552 0.79610101
#10 2.90701182 -0.93665406
cbind(rstudent(fit1), rstudent(fit2))
# [,1] [,2]
#1 0.74417620 1.90280959
#2 -0.67506054 -0.92973971
#3 0.76297805 -1.47237918
#4 0.71164461 0.01870820
#5 0.03337192 0.06042497
#6 -1.75099724 -0.04056992
#7 -1.05594284 1.02171222
#8 -0.48486883 -0.64316472
#9 -0.23468552 1.69605079
#10 2.90701182 -1.25676088
As you've observed, only the result for the first response is correctly returned by rstandard(fit)
.
Why rstudent
fails on "mlm"
The thing is, there is no "mlm" method for rstudent
.
methods(rstudent)
#[1] rstudent.glm* rstudent.lm*
When you call rstudent(fit)
, the S3 method dispatching mechanism finds rstudent.lm
instead, because inherits(fit, "lm")
is TRUE
. Unfortunately, stats:::rstudent.lm
is not doing the correct computation for an "mlm" model.
stats:::rstudent.lm
#function (model, infl = lm.influence(model, do.coef = FALSE),
# res = infl$wt.res, ...)
#{
# res <- res/(infl$sigma * sqrt(1 - infl$hat))
# res[is.infinite(res)] <- NaN
# res
#}
lm.influence
does not give the correct sigma
for "mlm". The underlying C routine C_influence
only computes sigma
for an "lm". If you give lm.influence
an "mlm", only the result for the first response variable is returned.
## pass in "mlm"
.Call(stats:::C_influence, fit$qr, FALSE, residuals(fit), 10 * .Machine$double.eps)$sigma
# [1] 1.3130265 1.3216357 1.3105706 1.3171621 1.3638689 1.1374385 1.2668101
# [8] 1.3416338 1.3586428 0.9180828
## pass in "lm"
.Call(stats:::C_influence, fit1$qr, FALSE, residuals(fit1), 10 * .Machine$double.eps)$sigma
# [1] 1.3130265 1.3216357 1.3105706 1.3171621 1.3638689 1.1374385 1.2668101
# [8] 1.3416338 1.3586428 0.9180828
Obviously for an "mlm" the sigma
should be a matrix. Now given this incorrect sigma
, "recycling rule" is applied behind the "/"
in the following line of stats:::rstudent.lm
, because the res
(residuals) on its left is a matrix but the stuff on its right is a vector.
res <- res / (infl$sigma * sqrt(1 - infl$hat))
Effectively, computational result is only correct for the first response variable; all the rest of response variables would use the wrong sigma
.
R core team needs to patch a number of diagnostic functions
Note that almost all functions listed in doc page ?influence.measures
are buggy for "mlm". They should have issued an warning saying that "mlm" methods are not implemented yet.
lm.influnce
needs be patched at C-level. As soon as this is done, rstudent.lm
would work correctly on "mlm".
Ohter functions can be easily patched at R-level, for example, stats:::cooks.distance.lm
and stats:::rstandard
. At the moment (R 3.5.1) they are:
stats:::cooks.distance.lm
#function (model, infl = lm.influence(model, do.coef = FALSE),
# res = weighted.residuals(model),
# sd = sqrt(deviance(model)/df.residual(model)),
# hat = infl$hat, ...)
#{
# p <- model$rank
# res <- ((res/(sd * (1 - hat)))^2 * hat)/p
# res[is.infinite(res)] <- NaN
# res
#}
stats:::rstandard.lm
#function (model, infl = lm.influence(model, do.coef = FALSE),
# sd = sqrt(deviance(model)/df.residual(model)), type = c("sd.1",
# "predictive"), ...)
#{
# type <- match.arg(type)
# res <- infl$wt.res/switch(type, sd.1 = sd * sqrt(1 - infl$hat),
# predictive = 1 - infl$hat)
# res[is.infinite(res)] <- NaN
# res
#}
and they can be patched (by using outer
) with
patched_cooks.distance.lm <-
function (model, infl = lm.influence(model, do.coef = FALSE),
res = weighted.residuals(model),
sd = sqrt(deviance(model)/df.residual(model)),
hat = infl$hat, ...)
{
p <- model$rank
res <- ((res / c(outer(1 - hat, sd))) ^ 2 * hat) / p
res[is.infinite(res)] <- NaN
res
}
patched_rstandard.lm <-
function (model, infl = lm.influence(model, do.coef = FALSE),
sd = sqrt(deviance(model)/df.residual(model)), type = c("sd.1",
"predictive"), ...)
{
type <- match.arg(type)
res <- infl$wt.res/switch(type, sd.1 = c(outer(sqrt(1 - infl$hat), sd)),
predictive = 1 - infl$hat)
res[is.infinite(res)] <- NaN
res
}
A quick test:
oo <- cbind(cooks.distance(fit1), cooks.distance(fit2)) ## correct result
all.equal(cooks.distance(fit), oo)
#[1] "Mean relative difference: 0.8211302"
all.equal(patched_cooks.distance.lm(fit), oo)
#[1] TRUE
rr <- cbind(rstandard(fit1), rstandard(fit2)) ## correct result
all.equal(rstandard(fit), rr)
#[1] "Mean relative difference: 0.5290036"
all.equal(patched_rstandard.lm(fit), rr)
#[1] TRUE
Thank you @李哲源; see also https://www.r-project.org/bugs.html on how to report bugs ... where they are noticed better by the R core team. Also, there, we could give better credit for patches..
Als note that R's source code - notably it's development version - is always available via svn ("subversion") or also a webbrowser at https://svn.r-project.org/R/trunk/
Both, cooks.distance.lm()
's and rstandard.lm()
's source code is in https://svn.r-project.org/R/trunk/src/library/stats/R/lm.influence.R .... all for a next time. Your proposed small code changes via outer()
are good enough of course.
Thank you very much for the thorough analysis and your well proposed bug fix!