Imputation using mice with clustered data

2019-04-14 12:42发布

问题:

So I am using the mice package to impute missing data. I'm new to imputation so I've got to a point but have run into a steep learning curve. To give a toy example:

library(mice)
# Using nhanes dataset as example
df1 <- mice(nhanes, m=10)

So as you can see I imputed df1 10 times using mostly default settings - and I am comfortable using this result in regression models, pooling results etc. However in my real life data, I have survey data from different countries. And so levels of missings differ by country, as do the values of specific variables - i.e. age, education level etc. Therefore I would like to impute the misssings, allowing for clustering by the country. So I will create a grouping variable which has no missings (of course in this toy example the correlations with other variables are missing, but in my real data they exist)

# Create a grouping variable
nhanes$country <- sample(c("A", "B"), size=nrow(nhanes), replace=TRUE)

So how to I tell mice() that this variable is different from the others - i.e. it is a level in a multi-level dataset?

回答1:

If you're thinking clusters as in "mixed-effects" models, then you should use the methods provided by mice intended for clustered data. These methods can be found in the manual and are usually prefixed like 2l.something.

The variety of methods for clustered data is somewhat limited in mice, but I can recommend using 2l.pan for missing data in lower-level units and 2l.only.norm at the cluster level.

As an alternative to mixed-effects models, you may consider using dummy indicators to represent the cluster structure (i.e., one dummy variable for each cluster). This method is not ideal when you think of the clusters from the perspective of mixed-effects models. So if you want to do mixed-effects analyses, then stick to mixed-effects models when you can.

Below, I show an example for both strategies.

Preparation:

library(mice)
data(nhanes)

set.seed(123)
nhanes <- within(nhanes,{
  country <- factor(sample(LETTERS[1:10], size=nrow(nhanes), replace=TRUE))
  countryID <- as.numeric(country)
})

Case 1: Imputation using mixed-effects models

This section uses 2l.pan to impute the three variables with missing data. Note that I use clusterID as the cluster variable by specifying a -2 in the predictor matrix. To all other variables, I assign fixed effects only (1).

# "empty" imputation as a template
imp0 <- mice(nhanes, maxit=0)
pred1 <- imp0$predictorMatrix
meth1 <- imp0$method

# set imputation procedures
meth1[c("bmi","hyp","chl")] <- "2l.pan"

# set predictor Matrix (mixed-effects models with random intercept
# for countryID and fixed effects otherwise)
pred1[,"country"] <- 0     # don't use country factor
pred1[,"countryID"] <- -2  # use countryID as cluster variable
pred1["bmi", c("age","hyp","chl")] <- c(1,1,1)  # fixed effects (bmi)
pred1["hyp", c("age","bmi","chl")] <- c(1,1,1)  # fixed effects (hyp)
pred1["chl", c("age","bmi","hyp")] <- c(1,1,1)  # fixed effects (chl)

# impute
imp1 <- mice(nhanes, maxit=20, m=10, predictorMatrix=pred1, method=meth1)

Case 2: Imputation using dummy indicators (DIs) for clusters

This section uses pmm for imputation, and the clustered structure is represented in an "ad hoc" fashion. That is, the clustered aren't represented by random effects but by fixed effects instead. This may exaggerate the cluster-level variability of the variables with missing data, so be sure you know what you do when you use it.

# create dummy indicator variables
DIs <- with(nhanes, contrasts(country)[country,])
colnames(DIs) <- paste0("country",colnames(DIs))
nhanes <- cbind(nhanes,DIs)


# "empty" imputation as a template
imp0 <- mice(nhanes, maxit=0)
pred2 <- imp0$predictorMatrix
meth2 <- imp0$method

# set imputation procedures
meth2[c("bmi","hyp","chl")] <- "pmm"

# for countryID and fixed effects otherwise)
pred2[,"country"] <- 0     # don't use country factor
pred2[,"countryID"] <- 0   # don't use countryID
pred2[,colnames(DIs)] <- 1 # use dummy indicators
pred2["bmi", c("age","hyp","chl")] <- c(1,1,1)  # fixed effects (bmi)
pred2["hyp", c("age","bmi","chl")] <- c(1,1,1)  # fixed effects (hyp)
pred2["chl", c("age","bmi","hyp")] <- c(1,1,1)  # fixed effects (chl)

# impute
imp2 <- mice(nhanes, maxit=20, m=10, predictorMatrix=pred2, method=meth2)

If you want to read up on what to think of these methods, have a look at one or two of these papers.



回答2:

You have to set up a predictorMatrix to tell mice which variable to use to impute another. A fast way in doing so is to use predictorM<-quickpred(nhanes)

Then you change the 1s in the matrix to 2 if it is a normal variable and -2 if it is the level two variable for different countries and submit it to the mice command as predictorMatrix =predictorM. In the method command you now have to set the methods to 2l.norm if it is a metric variable or 2l.binom if it is binary variable. For the latter you need the function written by Sabine Zinn (https://www.neps-data.de/Portals/0/Working%20Papers/WP_XXXI.pdf). Unfortunately it is not known to me if there methods for imputation of two level count data out there in the world.

Be aware imputing a multilevel datasets will slow down the process a lot. In my experience resampling method like PMM or in the Baboon package work well in keeping the hierarchical structure of the data and are much faster in use.