how to debug “factor has new levels” error for lin

2019-01-29 06:17发布

问题:

I am trying to make and test a linear model as follows:

lm_model <- lm(Purchase ~., data = train)
lm_prediction <- predict(lm_model, test)

This results in the following error, stating that the Product_Category_1 column has values that exist in the test data frame but not the train data frame):

factor Product_Category_1 has new levels 7, 9, 14, 16, 17, 18

However, if I check these they definitely look to appear in both data frames:

> nrow(subset(train, Product_Category_1 == "7"))
[1] 2923
> nrow(subset(test, Product_Category_1 == "7"))
[1] 745
> nrow(subset(train, Product_Category_1 == "9"))
[1] 312
> nrow(subset(test, Product_Category_1 == "9"))
[1] 92

Also showing the table for train and test show they have the same factors:

> table(train$Product_Category_1)

     1      2      3      4      5      6      7      8      9     10     11     12     13     14     15     16     17     18 
110820  18818  15820   9265 118955  16159   2923  89511    312   4030  19113   3108   4407   1201   4991   7730    467   2430 
> table(test$Product_Category_1)

    1     2     3     4     5     6     7     8     9    10    11    12    13    14    15    16    17    18 
27533  4681  4029  2301 29637  4005   745 22621    92  1002  4847   767  1033   299  1212  1967   100   645 
> 

回答1:

Table of Contents:

  • A simple example for walkthrough
  • Suggestion for users
  • Helpful information that we can get from the fitted model object
  • OK, I see what the problem is now, but how to make predict work?
  • Is there a better way to avoid such problem at all?

A simple example for walkthrough

Here is simple enough reproducible example to hint you what has happened.

train <- data.frame(y = runif(4), x = c(runif(3), NA), f = factor(letters[1:4]))
test <- data.frame(y = runif(4), x = runif(4), f = factor(letters[1:4]))
fit <- lm(y ~ x + f, data = train)
predict(fit, newdata = test)
#Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : 
#  factor f has new levels d

I am fitting a model with more parameters than data so the model is rank-deficient (to be explained in the end). However, this does not affect how lm and predict work.

If you just check table(train$f) and table(test$f) it is not useful as the problem is not caused by variable f but by NA in x. lm and glm drop incomplete cases, i.e., rows with at least one NA (see ?complete.cases) for model fitting. They have to do so as otherwise the underlying FORTRAN routine for QR factorization would fail because it can not handle NA. If you check the documentation at ?lm you will see this function has an argument na.action which defaults to na.omit. You can also set it to na.exclude but na.pass which retains NA will cause FORTRAN error:

fit <- lm(y ~ x + f, data = train, na.action = na.pass)
#Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : 
#  NA/NaN/Inf in 'x'

Let's remove NA from the training dataset.

train <- na.omit(train)
train$f
#[1] a b c
#Levels: a b c d

f now has an unused level "d". lm and glm will drop unused levels when building the model frame (and later the model matrix):

## source code of lm; don't run
mf$drop.unused.levels <- TRUE
mf[[1L]] <- quote(stats::model.frame)
mf <- eval(mf, parent.frame())

This is not user controllable. The reason is that if an unused level is included, it will generate a column of zeros in the model matrix.

mf <- model.frame(y ~ x + f, data = train, drop.unused.levels = FALSE)
model.matrix(y ~ x + f, data = mf)
#  (Intercept)          x fb fc fd
#1           1 0.90021178  0  0  0
#2           1 0.10188534  1  0  0
#3           1 0.05881954  0  1  0
#attr(,"assign")
#[1] 0 1 2 2 2
#attr(,"contrasts")
#attr(,"contrasts")$f
#[1] "contr.treatment"

This is undesired as it produces NA coefficient for dummy variable fd. By drop.unused.levels = TRUE as forced by lm and glm:

mf <- model.frame(y ~ x + f, data = train, drop.unused.levels = TRUE)
model.matrix(y ~ x + f, data = mf)    
#  (Intercept)          x fb fc
#1           1 0.90021178  0  0
#2           1 0.10188534  1  0
#3           1 0.05881954  0  1
#attr(,"assign")
#[1] 0 1 2 2
#attr(,"contrasts")
#attr(,"contrasts")$f
#[1] "contr.treatment"

The fd is gone, and

mf$f
#[1] a b c
#Levels: a b c

The now non-existing "d" level will cause the "new factor level" error in predict.


Suggestion for users

It is highly recommended that all users do the following manually when fitting models:

  • [No. 1] remove incomplete cases;
  • [No. 2] drop unused factor levels.

This is exactly the procedure as recommended here: How to debug "contrasts can be applied only to factors with 2 or more levels" error? This gets users aware of what lm and glm do under the hood, and makes their debugging life much easier.

Note, there should be another recommendation in the list:

  • [No. 0] do subsetting yourself

Users may occasionally use subset argument. But there is a potential pitfall: not all factor levels might appear in the subsetted dataset, thus you may get "new factor levels" when using predict later.

The above advice is particularly important when you write functions wrapping lm or glm. You want your functions to be robust. Ask your function to return an informative error rather than waiting for lm and glm to complain.


Helpful information that we can get from the fitted model object

lm and glm return an xlevels value in the fitted object. It contains the factor levels actually used for model fitting.

fit$xlevels
#$f
#[1] "a" "b" "c"

So in case you have not followed the recommendations listed above and have got into trouble with factor levels, this xlevels should be the first thing to inspect.

If you want to use something like table to count how many cases there are for each factor levels, here is a way: Get number of data in each factor level (as well as interaction) from a fitted lm or glm [R], although making a model matrix can use much RAM.


OK, I see what the problem is now, but how to make predict work?

If you can not choose to work with a different set of train and test dataset (see the next section), you need to set those factor levels in the test but not in xlevels to NA. Then predict will just predict NA for such incomplete cases.


Is there a better way to avoid such problem at all?

People split data into train and test as they want to do cross-validation. The first step is to apply na.omit on your full dataset to get rid of NA noise. Then we could do a random partitioning on what is left, but this this naive way may end up with

  • some factor levels in test but not in train (oops, we get "new factor level" error when using predict);
  • some factor variables in train only have 1 level after unused levels removed (oops, we get "contrasts" error when using lm and glm);

So, it is highly recommended that you do some more sophisticated partitioning like stratified sampling.

There is in fact another hazard, but not causing programming errors:

  • the model matrix for train is rank-deficient (oops, we get a "prediction for rank-deficient model may be misleading" warning when using predict).

Regarding the rank-deficiency in model fitting, see lme4::lmer reports "fixed-effect model matrix is rank deficient", do I need a fix and how to? Rank-deficiency does not cause problem for model estimation and checking, but can be a hazard for prediction: R lm, Could anyone give me an example of the misleading case on “prediction from a rank-deficient”? However, such issue is more difficult to avoid, particularly if you have many factors and possibly with interaction.