GAM with “gp” smoother: predict at new locations

2020-04-17 03:45发布

问题:

I am using the following geoadditive model

library(gamair)
library(mgcv)

data(mack)    
mack$log.net.area <- log(mack$net.area)

gm2 <- gam(egg.count ~ s(lon,lat,bs="gp",k=100,m=c(2,10,1)) +
                       s(I(b.depth^.5)) +
                       s(c.dist) +
                       s(temp.20m) +
                       offset(log.net.area),
                       data = mack, family = tw, method = "REML")

How can I use it to predict the value of egg.count at new locations (lon/lat) where I don't have covariate data, as in kriging?

For example say I want to predict egg.count at these new locations

    lon lat
1  -3.00  44
4  -2.75  44
7  -2.50  44
10 -2.25  44
13 -2.00  44
16 -1.75  44

but here I don't know the values of the covariates (b.depth, c.dist, temp.20m, log.net.area).

回答1:

predict still requires all variables used in your model to be presented in newdata, but you can pass in some arbitrary values, like 0s, to those covariates you don't have, then use type = "terms" and terms = name_of_the_wanted_smooth_term to proceed. Use

sapply(gm2$smooth, "[[", "label")
#[1] "s(lon,lat)"        "s(I(b.depth^0.5))" "s(c.dist)"        
#[4] "s(temp.20m)"

to check what smooth terms are in your model.

## new spatial locations to predict
newdat <- read.table(text = "lon lat
                             1  -3.00  44
                             4  -2.75  44
                             7  -2.50  44
                             10 -2.25  44
                             13 -2.00  44
                             16 -1.75  44")

## "garbage" values, just to pass the variable names checking in `predict.gam`
newdat[c("b.depth", "c.dist", "temp.20m", "log.net.area")] <- 0

## prediction on the link scale
pred_link <- predict(gm2, newdata = newdat, type = "terms", terms = "s(lon,lat)")
#   s(lon,lat)
#1  -1.9881967
#4  -1.9137971
#7  -1.6365945
#10 -1.1247837
#13 -0.7910023
#16 -0.7234683
#attr(,"constant")
#(Intercept) 
#   2.553535 

## simplify to vector
pred_link <- attr(pred_link, "constant") + rowSums(pred_link)
#[1] 0.5653381 0.6397377 0.9169403 1.4287511 1.7625325 1.8300665

## prediction on the response scale
pred_response <- gm2$family$linkinv(pred_link)
#[1] 1.760043 1.895983 2.501625 4.173484 5.827176 6.234301

I don't normally use predict.gam if I want to do prediction for a specific smooth term. The logic of predict.gam is to do prediction for all terms first, that is, the same as your doing type = "terms". Then

  • if type = "link", do a rowSums on all term-wise predictions plus an intercept (possibly with offset);
  • if type = "terms", and "terms" or "exclude" are unspecified, return the result as it is;
  • if type = "terms" and you have specified "terms" and / or "exclude", some post-process is done to drop terms you don't want and only give you those you want.

So, predict.gam will always do computation for all terms, even if you just want a single term.

Knowing the inefficiency behind this, this is what I will do:

sm <- gm2$smooth[[1]]  ## extract smooth construction info for `s(lon,lat)`
Xp <- PredictMat(sm, newdat)  ## predictor matrix
b <- gm2$coefficients[with(sm, first.para:last.para)]  ## coefficients for this term
pred_link <- c(Xp %*% b) + gm2$coef[[1]]  ## this term + intercept
#[1] 0.5653381 0.6397377 0.9169403 1.4287511 1.7625325 1.8300665
pred_response <- gm2$family$linkinv(pred_link)
#[1] 1.760043 1.895983 2.501625 4.173484 5.827176 6.234301

You see, we get the same result.


Won't the result depend the way on the value assigned to the covariates (here 0)?

Some garbage prediction will be made at those garbage values, but predict.gam discards them in the end.

Thanks, you are right. I am not totally sure to understand why then there is the option to add the covariates values at new locations.

Code maintenance is, as far as I feel, very difficult for a big package like mgcv. The code needs be changed significantly if you want it to suit every user's need. Obviously the predict.gam logic as I described here will be inefficient when people, like you, just want it to predict a certain smooth. And in theory if this is the case, variable names checking in newdata can ignore those terms not wanted by users. But, that requires significant change of predict.gam, and could potentially introduce many bugs due to code changes. Furthermore, you have to submit a changelog to CRAN, and CRAN may just not be happy to see this drastic change.

Simon once shared his feelings: there are many people telling me, I should write mgcv as this or as that, but I simply can't. Yeah, give some sympathy to a package author / maintainer like him.


Thanks for the update answer. However, I don't understand why the predictions don't depend on the values of the covariates at the new locations.

It will depend if you provide covariates values for b.depth, c.dist, temp.20m, log.net.area. But since you don't have them at new locations, the prediction is just to assume these effects to be 0.

OK thanks I see now! So would it be correct to say that in the absence of covariate values at new locations I am only predicting the response from the spatial autocorrelation of the residuals?

You are only predicting the spatial field / smooth. In GAM approach the spatial field is modeled as part of mean, not variance-covariance (as in kriging), so I think your use of "residuals" is not correct here.

Yes, you are right. Just to understand what this code does: would it be correct to say that I am predicting how the response changes over space but not its actual values at the new locations (since for that I would need the values of the covariates at these locations)?

Correct. You can try predict.gam with or without terms = "s(lon,lat)" to help you digest the output. See how it changes when you vary garbage values passed to other covariates.

## a possible set of garbage values for covariates
newdat[c("b.depth", "c.dist", "temp.20m", "log.net.area")] <- 0

predict(gm2, newdat, type = "terms")
#   s(lon,lat) s(I(b.depth^0.5)) s(c.dist) s(temp.20m)
#1  -1.9881967          -1.05514 0.4739174   -1.466549
#4  -1.9137971          -1.05514 0.4739174   -1.466549
#7  -1.6365945          -1.05514 0.4739174   -1.466549
#10 -1.1247837          -1.05514 0.4739174   -1.466549
#13 -0.7910023          -1.05514 0.4739174   -1.466549
#16 -0.7234683          -1.05514 0.4739174   -1.466549
#attr(,"constant")
#(Intercept) 
#   2.553535 

predict(gm2, newdat, type = "terms", terms = "s(lon,lat)")
#   s(lon,lat)
#1  -1.9881967
#4  -1.9137971
#7  -1.6365945
#10 -1.1247837
#13 -0.7910023
#16 -0.7234683
#attr(,"constant")
#(Intercept) 
#   2.553535 

## another possible set of garbage values for covariates
newdat[c("b.depth", "c.dist", "temp.20m", "log.net.area")] <- 1
#   s(lon,lat) s(I(b.depth^0.5))  s(c.dist) s(temp.20m)
#1  -1.9881967        -0.9858522 -0.3749018   -1.269878
#4  -1.9137971        -0.9858522 -0.3749018   -1.269878
#7  -1.6365945        -0.9858522 -0.3749018   -1.269878
#10 -1.1247837        -0.9858522 -0.3749018   -1.269878
#13 -0.7910023        -0.9858522 -0.3749018   -1.269878
#16 -0.7234683        -0.9858522 -0.3749018   -1.269878
#attr(,"constant")
#(Intercept) 
#   2.553535 

predict(gm2, newdat, type = "terms", terms = "s(lon,lat)")
#   s(lon,lat)
#1  -1.9881967
#4  -1.9137971
#7  -1.6365945
#10 -1.1247837
#13 -0.7910023
#16 -0.7234683
#attr(,"constant")
#(Intercept) 
#   2.553535