I have a serial data formatted as follows:
time milk Animal_ID
30 25.6 1
31 27.2 1
32 24.4 1
33 17.4 1
34 33.6 1
35 25.4 1
33 29.4 2
34 25.4 2
35 24.7 2
36 27.4 2
37 22.4 2
80 24.6 3
81 24.5 3
82 23.5 3
83 25.5 3
84 24.4 3
85 23.4 3
. . .
Generally, 300 animals have records of milk in different time points of short period. However, if we join their data together and do not care about different animal_ID, we would have a curve between milk~time like this, the line in figure below:
Also, in the above figure, we have data for 1 example animal, they are short and highly variable. My purposed is to smooth each animal data but it would be would if the model allows learning general patter from whole data to be included. I used different smooth model (ns, bs, smooth.spline) with the following format but it just did not work:
mod <- lme(milk ~ bs(time, df=3), data=dat, random = ~1|Animal_ID)
I am hoping if somebody has already dealt with this problem would give me an advice. Thanks
The full dataset can be accessed from here:
https://www.dropbox.com/s/z9b5teh3su87uu7/dat.txt?dl=0
I would suggest you use mgcv
package. This is one of the recommended R packages, performing a class of models called generalized additive mixed models. You can simply load it by library(mgcv)
. This is a very powerful library, which can handle from the simplest linear regression model, to generalized linear models, to additive models, to generalized additive models, as well as models with mixed effects (fixed effects + random effects). You can list all (exported) functions of mgcv
via
ls("package:mgcv")
And you can see there are many of them.
For your specific data and problem, you may use a model with formula:
model <- milk ~ s(time, bs = 'cr', k = 100) + s(Animal_ID, bs = 're')
In mgcv
, s()
is a setup for smooth functions, represented by spline basis implied by bs
. "cr" is the cubic spline basis, which is exactly what you want. k
is the number of knots. It should be chosen depending on the number of unique values of variable time
in your data set. If you set k
to exactly this number, you end up with a smoothing spline; while any value smaller than that means a regression spline. However, both will be penalized (if you know what penalization mean). I read your data in:
dat <- na.omit(read.csv("data.txt", header = TRUE)) ## I saved you data into file "data.txt"
dat$Animal_ID <- factor(dat$Animal_ID)
nrow(dat) ## 12624 observations
length(unique(dat$time)) ## 157 unique time points
length(ID <- levels(dat$Animal_ID)) ## 355 cows
There are 157 unique values, so I reckon k = 100
is possibly appropriate.
For Animal_ID
(coerced as a factor), we need a model term for random effect. "re" is a special class for i.i.d random effect. It is passed to bs
for some internal matrix construction reason (so this is not a smooth function!).
Now to fit a GAM model, you can call the legacy gam
or the constantly developing bam
(gam for big data). I think you will use the latter. They have the same calling convention similar to lm
and glm
. For example, you can do:
fit <- bam(model, data = dat, family = "gaussian", discrete = TRUE, nthreads = 2)
As you can see, bam
allows multi-core parallel computation via nthreads
. While discrete
is a newly developed feature which can speed up matrix formation.
Since you are dealing with time series data, finally you might consider some temporal autocorrelation. mgcv
allows configuration of AR1 correlation, whose correlation coefficient is passed by bam
argument rho
. However, you need an extra index AR_start
to tell mgcv
how the time series breaks up into pieces. For example, when reaching a different Animal_ID
, AR_start
get a TRUE
to indicate a new segment of time series. See ?bam
for details.
mgcv
also provides
summary.gam
function for model summary
gam.check
for basic model checking
plot.gam
function for plotting individual terms
predict.gam
(or predict.bam
) for prediction on new data.
For example, the summary of the above suggested model is:
> summary(fit)
Family: gaussian
Link function: identity
Formula:
milk ~ s(time, bs = "cr", k = 100) + s(Animal_ID, bs = "re")
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 26.1950 0.2704 96.89 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(time) 10.81 13.67 5.908 1.99e-11 ***
s(Animal_ID) 351.43 354.00 136.449 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.805 Deviance explained = 81.1%
fREML = 29643 Scale est. = 5.5681 n = 12624
The edf
(effective degree of freedom) may be thought of as a measure of the degree of non-linearity. So we put in k = 100
, while ending up with edf = 10.81
. This suggest that the spline s(time)
has been heavily penalized. You can view the what s(time)
looks like by:
plot.gam(fit, page = 1)
Note that the random effect s(Animal_ID)
also has a "smooth", that is an cow-specific constant. For random effects, a Gaussian QQ plot will be returned.
The diagnostic figures returned by
invisible(gam.check(fit))
looks OK, so I think the model is acceptable (I am not offering you model selection, so think up a better model if you think there is).
If you want to make prediction for Animal_ID = 26
, you may do
newd <- data.frame(time = 1:150, Animal_ID = 26)
oo <- predict.gam(fit, newd, type = `link`, se.fit = TRUE)
Note that
- You need to include both variables in
newd
(otherwise mgcv
complains missing variable)
- since you have only one spline smooth
s(time)
, and the random effect term s(Animal_ID)
is a constant per Animal_ID
. so it is OK to use type = 'link'
for individual prediction. By the way, type = 'terms'
is slower than type = 'link'
.
If you want to make prediction for more than one cows, try something like this:
pred.ID <- ID[1:10] ## predict first 10 cows
newd <- data.frame (time = rep (1:150, times = n), Animal_ID = factor (rep (pred.ID, each = 150)))
oo <- predict.bam (fit, newd, type = "link", se.fit = TRUE)
Note that
- I have used
predict.bam
here, as now we have 150 * 10 = 1500
data points to predict. Plus: we require se.fit = TRUE
. This is rather expensive, so use predict.bam
is faster than predict.gam
. Particularly, if you have fitted your model using bam(..., discrete = TRUE)
, you can have predict.bam(..., discrete = TRUE)
. Prediction process goes through the same matrix formation steps as in model fitting (see ?smoothCon
used in model fitting and ?PredictMat
used in prediction, if you are keen to know more internal structure of mgcv
.)
- I specified
Animal_ID
as factors, because this is a random effect.
For more on mgcv
, you can refer to library manual. Check specially ?mgcv
, ?gam
, ?bam
?s
.
Final update
Though I said that I will not help you with model section, but I think this model is better (it gives higher adj-Rsquared
) and is also more reasonable in sense:
model <- milk ~ s(time, bs = 'cr', k = 20) + s(Animal_ID, bs = 're') + s(Animal_ID, time, bs = 're')
The last term is imposing a random slop. This implies that we are assuming that each individual cow has different growing/reducing pattern of milk production. This is a more sensible assumption in your problem. The earlier model with only random intercept is not sufficient. After adding this random slop, the smooth term s(time)
looks smoother. This is a good sign not a bad sign, because we want some simple explanation for s(time)
, don't we? Compare the s(time)
you get from both models, and see what you discover.
I have also reduced k = 100
to k = 20
. As we saw in previous fit, the edf
for this term is about 10, so k = 20
is pretty sufficient.