So I am using the mice
package to impute missing data. I'm new to imputation so I've got to a point but have run into a steep learning curve. To give a toy example:
library(mice)
# Using nhanes dataset as example
df1 <- mice(nhanes, m=10)
So as you can see I imputed df1 10 times using mostly default settings - and I am comfortable using this result in regression models, pooling results etc. However in my real life data, I have survey data from different countries. And so levels of missings differ by country, as do the values of specific variables - i.e. age, education level etc. Therefore I would like to impute the misssings, allowing for clustering by the country. So I will create a grouping variable which has no missings (of course in this toy example the correlations with other variables are missing, but in my real data they exist)
# Create a grouping variable
nhanes$country <- sample(c("A", "B"), size=nrow(nhanes), replace=TRUE)
So how to I tell mice()
that this variable is different from the others - i.e. it is a level in a multi-level dataset?
If you're thinking clusters as in "mixed-effects" models, then you should use the methods provided by
mice
intended for clustered data. These methods can be found in the manual and are usually prefixed like2l.something
.The variety of methods for clustered data is somewhat limited in
mice
, but I can recommend using2l.pan
for missing data in lower-level units and2l.only.norm
at the cluster level.As an alternative to mixed-effects models, you may consider using dummy indicators to represent the cluster structure (i.e., one dummy variable for each cluster). This method is not ideal when you think of the clusters from the perspective of mixed-effects models. So if you want to do mixed-effects analyses, then stick to mixed-effects models when you can.
Below, I show an example for both strategies.
Preparation:
Case 1: Imputation using mixed-effects models
This section uses
2l.pan
to impute the three variables with missing data. Note that I useclusterID
as the cluster variable by specifying a-2
in the predictor matrix. To all other variables, I assign fixed effects only (1
).Case 2: Imputation using dummy indicators (DIs) for clusters
This section uses
pmm
for imputation, and the clustered structure is represented in an "ad hoc" fashion. That is, the clustered aren't represented by random effects but by fixed effects instead. This may exaggerate the cluster-level variability of the variables with missing data, so be sure you know what you do when you use it.If you want to read up on what to think of these methods, have a look at one or two of these papers.
You have to set up a predictorMatrix to tell mice which variable to use to impute another. A fast way in doing so is to use
predictorM<-quickpred(nhanes)
Then you change the 1s in the matrix to 2 if it is a normal variable and -2 if it is the level two variable for different countries and submit it to the mice command as
predictorMatrix =predictorM
. In the method command you now have to set the methods to2l.norm
if it is a metric variable or2l.binom
if it is binary variable. For the latter you need the function written by Sabine Zinn (https://www.neps-data.de/Portals/0/Working%20Papers/WP_XXXI.pdf). Unfortunately it is not known to me if there methods for imputation of two level count data out there in the world.Be aware imputing a multilevel datasets will slow down the process a lot. In my experience resampling method like PMM or in the Baboon package work well in keeping the hierarchical structure of the data and are much faster in use.