I have a data set called data
which has 481 092 rows.
I split data
into two equal halves:
- The first halve (row 1: 240 546) is called
train
and was used for theglm()
; - the second halve (row 240 547 : 481 092) is called
test
and should be used to validate the model;
Then I started the regression:
testreg <- glm(train$returnShipment ~ train$size + train$color + train$price +
train$manufacturerID + train$salutation + train$state +
train$age + train$deliverytime,
family=binomial(link="logit"), data=train)
Now the prediction:
prediction <- predict.glm(testreg, newdata=test, type="response")
gives me an Error:
Error in model.frame.default(Terms, newdata, na.action=na.action, xlev=object$xlevels):
Factor 'train$manufacturerID' has new levels 125, 136, 137
Now I know that these levels were omitted in the regression because it doesn't show any coefficients for these levels.
I have tried this: predict.lm() with an unknown factor level in test data . But it somehow doesn't work for me or I maybe just don't get how to implement it. I want to predict the dependent binary variable but of course only with the existing coefficients. The link above suggests to tell R that rows with new levels should just be called /or treated as NA.
How can I proceed?
Edit-Suggested approach by Z. Li
I got problem in the first step:
xlevels <- testreg$xlevels$manufacturerID
mID125 <- xlevels[1]
but mID125
is NULL
! What have I done wrong?
It is impossible to get estimation of new factor levels, in fixed effect modelling, including linear models and generalized linear models.
glm
(as well aslm
) keeps records of what factor levels are presented and used during model fitting, and can be found intestreg$xlevels
.Your model formula for model estimation is:
then
predict
complains new factor levels 125, 136, 137 formanufactureID
. This means, these levels are not insidetestreg$xlevels$manufactureID
, therefore has no associated coefficient for prediction. In this case, we have to drop this factor variable and use a prediction formula:However, the standard
predict
routine can not take your customized prediction formula. There are commonly two solutions:testreg
, and manually predict model terms we want by matrix-vector multiplication. This is what the link given in your post suggests to do;test
into any one level appeared intestreg$xlevels$manufactureID
, for example,testreg$xlevels$manufactureID[1]
. As such, we can still use the standardpredict
for prediction.Now, let's first pick up a factor level used for model fitting
Then we assign this level to your prediction data:
And we are ready to predict:
In the end, we adjust this linear predictor, by subtracting factor estimate:
Finally, if you want prediction on the original scale, you apply the inverse of link function:
update:
You complained that you met various troubles in trying the above solutions. Here is why.
Your code:
is a very bad way to specify your model formula.
train$returnShipment
, etc, will restrict the environment of getting variables strictly to data frametrain
, and you will have trouble in later prediction with other data sets, liketest
.As a simple example for such drawback, we simulate some toy data and fit a GLM:
Now, we see everything comes with a prefix
foo$
. During prediction:we get an error:
The good style is to specify environment of getting data from
data
argument of the function:then
foo$
goes away.This would explain two things:
testreg$xlevels$manufactureID
, you getNULL
;The prediction error you posted
complains
train$manufacturerID
instead oftest$manufacturerID
.As you have divided your
train
andtest
sample based on rownumbers, some factor levels of your variables are not equally represented in both the train and test samples.You need to do stratified sampling to ensure that both train and test samples have all factor level representations. Use
stratified
from thesplitstackshape
package.