mgcv: how to extract knots, basis, coefficients an

2019-01-23 20:44发布

问题:

I'm using the mgcv package in R to fit some polynomial splines to some data via:

x.gam <- gam(cts ~ s(time, bs = "ad"), data = x.dd,
             family = poisson(link = "log"))

I'm trying to extract the functional form of the fit. x.gam is a gamObject, and I've been reading the documentation but haven't found enough information in order to manually reconstruct the fitted function.

  • x.gam$smooth contains information about whether the knots have been placed;
  • x.gam$coefficients gives the spline coefficients, but I don't know what order polynomial splines are used and looking in the code has not revealed anything.

Is there a neat way to extract the knots, coefficients and basis used so that one can manually reconstruct the fit?

回答1:

I don't have your data, so I take the following example from ?adaptive.smooth to show you where you can find information you want. Note that though this example is for Gaussian data rather than Poisson data, only the link function is different; all the rest are just standard.

 x <- 1:1000/1000  # data between [0, 1]
 mu <- exp(-400*(x-.6)^2)+5*exp(-500*(x-.75)^2)/3+2*exp(-500*(x-.9)^2)
 y <- mu+0.5*rnorm(1000)
 b <- gam(y~s(x,bs="ad",k=40,m=5))

Now, all information on smooth construction is stored in b$smooth, we take it out:

smooth <- b$smooth[[1]]  ## extract smooth object for first smooth term

knots:

smooth$knots gives you location of knots.

> smooth$knots
 [1] -0.081161 -0.054107 -0.027053  0.000001  0.027055  0.054109  0.081163
 [8]  0.108217  0.135271  0.162325  0.189379  0.216433  0.243487  0.270541
[15]  0.297595  0.324649  0.351703  0.378757  0.405811  0.432865  0.459919
[22]  0.486973  0.514027  0.541081  0.568135  0.595189  0.622243  0.649297
[29]  0.676351  0.703405  0.730459  0.757513  0.784567  0.811621  0.838675
[36]  0.865729  0.892783  0.919837  0.946891  0.973945  1.000999  1.028053
[43]  1.055107  1.082161

Note, three external knots are placed beyond each side of [0, 1] to construct spline basis.

basis class

attr(smooth, "class") tells you the type of spline. As you can read from ?adaptive.smooth, for bs = ad, mgcv use P-splines, hence you get "pspline.smooth".

mgcv use 2nd order pspline, you can verify this by checking the difference matrix smooth$D. Below is a snapshot:

> smooth$D[1:6,1:6]
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    1   -2    1    0    0    0
[2,]    0    1   -2    1    0    0
[3,]    0    0    1   -2    1    0
[4,]    0    0    0    1   -2    1
[5,]    0    0    0    0    1   -2
[6,]    0    0    0    0    0    1

coefficients

You have already known that b$coefficients contain model coefficients:

beta <- b$coefficients

Note this is a named vector:

> beta
(Intercept)      s(x).1      s(x).2      s(x).3      s(x).4      s(x).5 
 0.37792619 -0.33500685 -0.30943814 -0.30908847 -0.31141148 -0.31373448 
     s(x).6      s(x).7      s(x).8      s(x).9     s(x).10     s(x).11 
-0.31605749 -0.31838050 -0.32070350 -0.32302651 -0.32534952 -0.32767252 
    s(x).12     s(x).13     s(x).14     s(x).15     s(x).16     s(x).17 
-0.32999553 -0.33231853 -0.33464154 -0.33696455 -0.33928755 -0.34161055 
    s(x).18     s(x).19     s(x).20     s(x).21     s(x).22     s(x).23 
-0.34393354 -0.34625650 -0.34857906 -0.05057041  0.48319491  0.77251118 
    s(x).24     s(x).25     s(x).26     s(x).27     s(x).28     s(x).29 
 0.49825345  0.09540020 -0.18950763  0.16117012  1.10141701  1.31089436 
    s(x).30     s(x).31     s(x).32     s(x).33     s(x).34     s(x).35 
 0.62742937 -0.23435309 -0.19127140  0.79615752  1.85600016  1.55794576 
    s(x).36     s(x).37     s(x).38     s(x).39 
 0.40890236 -0.20731309 -0.47246357 -0.44855437

basis matrix / model matrix / linear predictor matrix (lpmatrix)

You can get model matrix from:

mat <- predict.gam(b, type = "lpmatrix")

This is an n-by-p matrix, where n is the number of observations, and p is the number of coefficients. This matrix has column name:

> head(mat[,1:5])
  (Intercept)    s(x).1    s(x).2      s(x).3      s(x).4
1           1 0.6465774 0.1490613 -0.03843899 -0.03844738
2           1 0.6437580 0.1715691 -0.03612433 -0.03619157
3           1 0.6384074 0.1949416 -0.03391686 -0.03414389
4           1 0.6306815 0.2190356 -0.03175713 -0.03229541
5           1 0.6207361 0.2437083 -0.02958570 -0.03063719
6           1 0.6087272 0.2688168 -0.02734314 -0.02916029

The first column is all 1, giving intercept. While s(x).1 suggests the first basis function for s(x). If you want to view what individual basis function look like, you can plot a column of mat against your variable. For example:

plot(x, mat[, "s(x).20"], type = "l", main = "20th basis")

linear predictor

If you want to manually construct the fit, you can do:

pred.linear <- mat %*% beta

Note that this is exactly what you can get from b$linear.predictors or

predict.gam(b, type = "link")

response / fitted values

For non-Gaussian data, if you want to get response variable, you can apply inverse link function to linear predictor to map back to original scale.

Family information are stored in gamObject$family, and gamObject$family$linkinv is the inverse link function. The above example will certain gives you identity link, but for your fitted object x.gam, you can do:

x.gam$family$linkinv(x.gam$linear.predictors)

Note this is the same to x.gam$fitted, or

predict.gam(x.gam, type = "response").

Other links

I have just realized that there were quite a lot of similar questions before.

  1. This answer by Gavin Simpson is great, for predict.gam( , type = 'lpmatrix').
  2. This answer is about predict.gam(, type = 'terms').

But anyway, the best reference is always ?predict.gam, which includes extensive examples.