rstudent() returns incorrect result for an “mlm” (

2019-07-18 06:53发布

问题:

I know that the support for linear models with multiple LHS is limited. But when it is possible to run a function on an "mlm" object, I would expect the results to be trusty. When using rstudent, strange results are produced. Is this a bug or is there some other explanation?

In the example below fittedA and fittedB are identical, but in the case of rstudent the 2nd column differs.

y <- matrix(rnorm(20), 10, 2)
x <- 1:10
fittedA <- fitted(lm(y ~ x))
fittedB <- cbind(fitted(lm(y[, 1] ~ x)), fitted(lm(y[, 2] ~ x)))
rstudentA <- rstudent(lm(y ~ x))
rstudentB <- cbind(rstudent(lm(y[, 1] ~ x)), rstudent(lm(y[, 2] ~ x)))

回答1:

Setup

set.seed(0)
y <- matrix(rnorm(20), 10, 2)
x <- 1:10
fit <- lm(y ~ x)           ## class: "mlm", "lm"
fit1 <- lm(y[, 1] ~ x)     ## class: "lm"
fit2 <- lm(y[, 2] ~ x)     ## class: "lm"

rstudent(fit)
#          [,1]        [,2]
#1   0.74417620  0.89121744
#2  -0.67506054 -0.50275275
#3   0.76297805 -0.74363941
#4   0.71164461  0.01075898
#5   0.03337192  0.03355209
#6  -1.75099724 -0.02701558
#7  -1.05594284  0.56993056
#8  -0.48486883 -0.35286612
#9  -0.23468552  0.79610101
#10  2.90701182 -0.93665406

cbind(rstudent(fit1), rstudent(fit2))
#          [,1]        [,2]
#1   0.74417620  1.90280959
#2  -0.67506054 -0.92973971
#3   0.76297805 -1.47237918
#4   0.71164461  0.01870820
#5   0.03337192  0.06042497
#6  -1.75099724 -0.04056992
#7  -1.05594284  1.02171222
#8  -0.48486883 -0.64316472
#9  -0.23468552  1.69605079
#10  2.90701182 -1.25676088

As you've observed, only the result for the first response is correctly returned by rstandard(fit).


Why rstudent fails on "mlm"

The thing is, there is no "mlm" method for rstudent.

methods(rstudent)
#[1] rstudent.glm* rstudent.lm*

When you call rstudent(fit), the S3 method dispatching mechanism finds rstudent.lm instead, because inherits(fit, "lm") is TRUE. Unfortunately, stats:::rstudent.lm is not doing the correct computation for an "mlm" model.

stats:::rstudent.lm
#function (model, infl = lm.influence(model, do.coef = FALSE), 
#    res = infl$wt.res, ...) 
#{
#    res <- res/(infl$sigma * sqrt(1 - infl$hat))
#    res[is.infinite(res)] <- NaN
#    res
#}

lm.influence does not give the correct sigma for "mlm". The underlying C routine C_influence only computes sigma for an "lm". If you give lm.influence an "mlm", only the result for the first response variable is returned.

## pass in "mlm"
.Call(stats:::C_influence, fit$qr, FALSE, residuals(fit), 10 * .Machine$double.eps)$sigma
# [1] 1.3130265 1.3216357 1.3105706 1.3171621 1.3638689 1.1374385 1.2668101
# [8] 1.3416338 1.3586428 0.9180828

## pass in "lm"
.Call(stats:::C_influence, fit1$qr, FALSE, residuals(fit1), 10 * .Machine$double.eps)$sigma
# [1] 1.3130265 1.3216357 1.3105706 1.3171621 1.3638689 1.1374385 1.2668101
# [8] 1.3416338 1.3586428 0.9180828

Obviously for an "mlm" the sigma should be a matrix. Now given this incorrect sigma, "recycling rule" is applied behind the "/" in the following line of stats:::rstudent.lm, because the res (residuals) on its left is a matrix but the stuff on its right is a vector.

res <- res / (infl$sigma * sqrt(1 - infl$hat))

Effectively, computational result is only correct for the first response variable; all the rest of response variables would use the wrong sigma.


R core team needs to patch a number of diagnostic functions

Note that almost all functions listed in doc page ?influence.measures are buggy for "mlm". They should have issued an warning saying that "mlm" methods are not implemented yet.

lm.influnce needs be patched at C-level. As soon as this is done, rstudent.lm would work correctly on "mlm".

Ohter functions can be easily patched at R-level, for example, stats:::cooks.distance.lm and stats:::rstandard. At the moment (R 3.5.1) they are:

stats:::cooks.distance.lm
#function (model, infl = lm.influence(model, do.coef = FALSE), 
#    res = weighted.residuals(model),
#    sd = sqrt(deviance(model)/df.residual(model)), 
#    hat = infl$hat, ...) 
#{
#    p <- model$rank
#    res <- ((res/(sd * (1 - hat)))^2 * hat)/p
#    res[is.infinite(res)] <- NaN
#    res
#}

stats:::rstandard.lm
#function (model, infl = lm.influence(model, do.coef = FALSE), 
#    sd = sqrt(deviance(model)/df.residual(model)), type = c("sd.1", 
#        "predictive"), ...) 
#{
#    type <- match.arg(type)
#    res <- infl$wt.res/switch(type, sd.1 = sd * sqrt(1 - infl$hat), 
#        predictive = 1 - infl$hat)
#    res[is.infinite(res)] <- NaN
#    res
#}

and they can be patched (by using outer) with

patched_cooks.distance.lm <-
function (model, infl = lm.influence(model, do.coef = FALSE), 
    res = weighted.residuals(model),
    sd = sqrt(deviance(model)/df.residual(model)), 
    hat = infl$hat, ...) 
{
    p <- model$rank
    res <- ((res / c(outer(1 - hat, sd))) ^ 2 * hat) / p
    res[is.infinite(res)] <- NaN
    res
}

patched_rstandard.lm <- 
function (model, infl = lm.influence(model, do.coef = FALSE), 
    sd = sqrt(deviance(model)/df.residual(model)), type = c("sd.1", 
        "predictive"), ...) 
{
    type <- match.arg(type)
    res <- infl$wt.res/switch(type, sd.1 = c(outer(sqrt(1 - infl$hat), sd)), 
        predictive = 1 - infl$hat)
    res[is.infinite(res)] <- NaN
    res
}

A quick test:

oo <- cbind(cooks.distance(fit1), cooks.distance(fit2))  ## correct result
all.equal(cooks.distance(fit), oo)
#[1] "Mean relative difference: 0.8211302"
all.equal(patched_cooks.distance.lm(fit), oo)
#[1] TRUE

rr <- cbind(rstandard(fit1), rstandard(fit2))  ## correct result
all.equal(rstandard(fit), rr)
#[1] "Mean relative difference: 0.5290036"
all.equal(patched_rstandard.lm(fit), rr)
#[1] TRUE


回答2:

Thank you @李哲源; see also https://www.r-project.org/bugs.html on how to report bugs ... where they are noticed better by the R core team. Also, there, we could give better credit for patches..
Als note that R's source code - notably it's development version - is always available via svn ("subversion") or also a webbrowser at https://svn.r-project.org/R/trunk/

Both, cooks.distance.lm()'s and rstandard.lm()'s source code is in https://svn.r-project.org/R/trunk/src/library/stats/R/lm.influence.R .... all for a next time. Your proposed small code changes via outer() are good enough of course.

Thank you very much for the thorough analysis and your well proposed bug fix!