I am trying to make and test a linear model as follows:
lm_model <- lm(Purchase ~., data = train)
lm_prediction <- predict(lm_model, test)
This results in the following error, stating that the Product_Category_1
column has values that exist in the test
data frame but not the train
data frame):
factor Product_Category_1 has new levels 7, 9, 14, 16, 17, 18
However, if I check these they definitely look to appear in both data frames:
> nrow(subset(train, Product_Category_1 == "7"))
[1] 2923
> nrow(subset(test, Product_Category_1 == "7"))
[1] 745
> nrow(subset(train, Product_Category_1 == "9"))
[1] 312
> nrow(subset(test, Product_Category_1 == "9"))
[1] 92
Also showing the table for train
and test
show they have the same factors:
> table(train$Product_Category_1)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
110820 18818 15820 9265 118955 16159 2923 89511 312 4030 19113 3108 4407 1201 4991 7730 467 2430
> table(test$Product_Category_1)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
27533 4681 4029 2301 29637 4005 745 22621 92 1002 4847 767 1033 299 1212 1967 100 645
>
Table of Contents:
- A simple example for walkthrough
- Suggestion for users
- Helpful information that we can get from the fitted model object
- OK, I see what the problem is now, but how to make
predict
work?
- Is there a better way to avoid such problem at all?
A simple example for walkthrough
Here is simple enough reproducible example to hint you what has happened.
train <- data.frame(y = runif(4), x = c(runif(3), NA), f = factor(letters[1:4]))
test <- data.frame(y = runif(4), x = runif(4), f = factor(letters[1:4]))
fit <- lm(y ~ x + f, data = train)
predict(fit, newdata = test)
#Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
# factor f has new levels d
I am fitting a model with more parameters than data so the model is rank-deficient (to be explained in the end). However, this does not affect how lm
and predict
work.
If you just check table(train$f)
and table(test$f)
it is not useful as the problem is not caused by variable f
but by NA
in x
. lm
and glm
drop incomplete cases, i.e., rows with at least one NA
(see ?complete.cases
) for model fitting. They have to do so as otherwise the underlying FORTRAN routine for QR factorization would fail because it can not handle NA
. If you check the documentation at ?lm
you will see this function has an argument na.action
which defaults to na.omit
. You can also set it to na.exclude
but na.pass
which retains NA
will cause FORTRAN error:
fit <- lm(y ~ x + f, data = train, na.action = na.pass)
#Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
# NA/NaN/Inf in 'x'
Let's remove NA
from the training dataset.
train <- na.omit(train)
train$f
#[1] a b c
#Levels: a b c d
f
now has an unused level "d"
. lm
and glm
will drop unused levels when building the model frame (and later the model matrix):
## source code of lm; don't run
mf$drop.unused.levels <- TRUE
mf[[1L]] <- quote(stats::model.frame)
mf <- eval(mf, parent.frame())
This is not user controllable. The reason is that if an unused level is included, it will generate a column of zeros in the model matrix.
mf <- model.frame(y ~ x + f, data = train, drop.unused.levels = FALSE)
model.matrix(y ~ x + f, data = mf)
# (Intercept) x fb fc fd
#1 1 0.90021178 0 0 0
#2 1 0.10188534 1 0 0
#3 1 0.05881954 0 1 0
#attr(,"assign")
#[1] 0 1 2 2 2
#attr(,"contrasts")
#attr(,"contrasts")$f
#[1] "contr.treatment"
This is undesired as it produces NA
coefficient for dummy variable fd
. By drop.unused.levels = TRUE
as forced by lm
and glm
:
mf <- model.frame(y ~ x + f, data = train, drop.unused.levels = TRUE)
model.matrix(y ~ x + f, data = mf)
# (Intercept) x fb fc
#1 1 0.90021178 0 0
#2 1 0.10188534 1 0
#3 1 0.05881954 0 1
#attr(,"assign")
#[1] 0 1 2 2
#attr(,"contrasts")
#attr(,"contrasts")$f
#[1] "contr.treatment"
The fd
is gone, and
mf$f
#[1] a b c
#Levels: a b c
The now non-existing "d"
level will cause the "new factor level" error in predict
.
Suggestion for users
It is highly recommended that all users do the following manually when fitting models:
- [No. 1] remove incomplete cases;
- [No. 2] drop unused factor levels.
This is exactly the procedure as recommended here: How to debug "contrasts can be applied only to factors with 2 or more levels" error? This gets users aware of what lm
and glm
do under the hood, and makes their debugging life much easier.
Note, there should be another recommendation in the list:
- [No. 0] do subsetting yourself
Users may occasionally use subset
argument. But there is a potential pitfall: not all factor levels might appear in the subsetted dataset, thus you may get "new factor levels" when using predict
later.
The above advice is particularly important when you write functions wrapping lm
or glm
. You want your functions to be robust. Ask your function to return an informative error rather than waiting for lm
and glm
to complain.
Helpful information that we can get from the fitted model object
lm
and glm
return an xlevels
value in the fitted object. It contains the factor levels actually used for model fitting.
fit$xlevels
#$f
#[1] "a" "b" "c"
So in case you have not followed the recommendations listed above and have got into trouble with factor levels, this xlevels
should be the first thing to inspect.
If you want to use something like table
to count how many cases there are for each factor levels, here is a way: Get number of data in each factor level (as well as interaction) from a fitted lm or glm [R], although making a model matrix can use much RAM.
OK, I see what the problem is now, but how to make predict
work?
If you can not choose to work with a different set of train
and test
dataset (see the next section), you need to set those factor levels in the test
but not in xlevels
to NA
. Then predict
will just predict NA
for such incomplete cases.
Is there a better way to avoid such problem at all?
People split data into train
and test
as they want to do cross-validation. The first step is to apply na.omit
on your full dataset to get rid of NA
noise. Then we could do a random partitioning on what is left, but this this naive way may end up with
- some factor levels in
test
but not in train
(oops, we get "new factor level" error when using predict
);
- some factor variables in
train
only have 1 level after unused levels removed (oops, we get "contrasts" error when using lm
and glm
);
So, it is highly recommended that you do some more sophisticated partitioning like stratified sampling.
There is in fact another hazard, but not causing programming errors:
- the model matrix for
train
is rank-deficient (oops, we get a "prediction for rank-deficient model may be misleading" warning when using predict
).
Regarding the rank-deficiency in model fitting, see lme4::lmer reports "fixed-effect model matrix is rank deficient", do I need a fix and how to? Rank-deficiency does not cause problem for model estimation and checking, but can be a hazard for prediction: R lm
, Could anyone give me an example of the misleading case on “prediction from a rank-deficient”? However, such issue is more difficult to avoid, particularly if you have many factors and possibly with interaction.