Here are all the variables I'm working with:
str(ad.train)
$ Date : Factor w/ 427 levels "2012-03-24","2012-03-29",..: 4 7 12 14 19 21 24 29 31 34 ...
$ Team : Factor w/ 18 levels "Adelaide","Brisbane Lions",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Season : int 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
$ Round : Factor w/ 28 levels "EF","GF","PF",..: 5 16 21 22 23 24 25 26 27 6 ...
$ Score : int 137 82 84 96 110 99 122 124 49 111 ...
$ Margin : int 69 18 -56 46 19 5 50 69 -26 29 ...
$ WinLoss : Factor w/ 2 levels "0","1": 2 2 1 2 2 2 2 2 1 2 ...
$ Opposition : Factor w/ 18 levels "Adelaide","Brisbane Lions",..: 8 18 10 9 13 16 7 3 4 6 ...
$ Venue : Factor w/ 19 levels "Adelaide Oval",..: 4 7 10 7 7 13 7 6 7 15 ...
$ Disposals : int 406 360 304 370 359 362 365 345 324 351 ...
$ Kicks : int 252 215 170 225 221 218 224 230 205 215 ...
$ Marks : int 109 102 52 41 95 78 93 110 69 85 ...
$ Handballs : int 154 145 134 145 138 144 141 115 119 136 ...
$ Goals : int 19 11 12 13 16 15 19 19 6 17 ...
$ Behinds : int 19 14 9 16 11 6 7 9 12 6 ...
$ Hitouts : int 42 41 34 47 45 70 48 54 46 34 ...
$ Tackles : int 73 53 51 76 65 63 65 67 77 58 ...
$ Rebound50s : int 28 34 23 24 32 48 39 31 34 29 ...
$ Inside50s : int 73 49 49 56 61 45 47 50 49 48 ...
$ Clearances : int 39 33 38 52 37 43 43 48 37 52 ...
$ Clangers : int 47 38 44 62 49 46 32 24 31 41 ...
$ FreesFor : int 15 14 15 18 17 15 19 14 18 20 ...
$ ContendedPossessions: int 152 141 149 192 138 164 148 151 160 155 ...
$ ContestedMarks : int 10 16 11 3 12 12 17 14 15 11 ...
$ MarksInside50 : int 16 13 10 8 12 9 14 13 6 12 ...
$ OnePercenters : int 42 54 30 58 24 56 32 53 50 57 ...
$ Bounces : int 1 6 4 4 1 7 11 14 0 4 ...
$ GoalAssists : int 15 6 9 10 9 12 13 14 5 14 ...
Here's the glm I'm trying to fit:
ad.glm.all <- glm(WinLoss ~ factor(Team) + Season + Round + Score + Margin + Opposition + Venue + Disposals + Kicks + Marks + Handballs + Goals + Behinds + Hitouts + Tackles + Rebound50s + Inside50s+ Clearances+ Clangers+ FreesFor + ContendedPossessions + ContestedMarks + MarksInside50 + OnePercenters + Bounces+GoalAssists,
data = ad.train, family = binomial(logit))
I know it's a lot of variables (plan is to reduce via forward variable selection). But even know it's a lot of variables they're either int or Factor; which as I understand things should just work with a glm. However, every time I try to fit this model I get:
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : contrasts can be applied only to factors with 2 or more levels
Which sort of looks to me as if R isn't treating my Factor variables as Factor variables for some reason?
Even something as simple as:
ad.glm.test <- glm(WinLoss ~ factor(Team), data = ad.train, family = binomial(logit))
isn't working! (same error message)
Where as this:
ad.glm.test <- glm(WinLoss ~ Clearances, data = ad.train, family = binomial(logit))
Will work!
Anyone know what's going on here? Why can't I fit these Factor variables to my glm??
Thanks in advance!
-Troy
Introduction
What a "contrasts error" is has been well explained: you have a factor that only has one level (or less). But in reality this simple fact can be easily obscured because the data that are actually used for model fitting can be very different from what you've passed in. This happens when you have
NA
in your data, you've subsetted your data, a factor has unused levels, or you've transformed your variables and getNaN
somewhere. You are rarely in this ideal situation where a single-level factor can be spotted fromstr(your_data_frame)
directly. Many questions on StackOverflow regarding this error are not reproducible, thus suggestions by people may or may not work. Therefore, although there are by now 118 posts regarding this issue, users still can't find an adaptive solution so that this question is raised again and again. This answer is my attempt, to solve this matter "once for all", or at least to provide a reasonable guide.This answer has rich information, so let me first make a quick summary.
I defined 3 helper functions for you:
debug_contr_error
,debug_contr_error2
,NA_preproc
.I recommend you use them in the following way.
NA_preproc
to get more complete cases;debug_contr_error2
for debugging.Most of the answer shows you step by step how & why these functions are defined. There is probably no harm to skip those development process, but don't skip sections from "Reproducible case studies and Discussions".
Revised answer
The original answer works perfectly for OP, and has successfully helped some others. But it had failed somewhere else for lack of adaptiveness. Look at the output of
str(ad.train)
in the question. OP's variables are numeric or factors; there are no characters. The original answer was for this situation. If you have character variables, although they will be coerced to factors duringlm
andglm
fitting, they won't be reported by the code since they were not provided as factors sois.factor
will miss them. In this expansion I will make the original answer both more adaptive.Let
dat
be your dataset passed tolm
orglm
. If you don't readily have such a data frame, that is, all your variables are scattered in the global environment, you need to gather them into a data frame. The following may not be the best way but it works.Step 0: explicit subsetting
If you've used the
subset
argument oflm
orglm
, start by an explicit subsetting:Step 1: remove incomplete cases
You can skip this step if you've gone through step 0, since
subset
automatically removes incomplete cases.Step 2: mode checking and conversion
A data frame column is usually an atomic vector, with a mode from the following: "logical", "numeric", "complex", "character", "raw". For regression, variables of different modes are handled differently.
A logical variable is tricky. It can either be treated as a dummy variable (
1
forTRUE
;0
forFALSE
) hence a "numeric", or it can be coerced to a two-level factor. It all depends on whethermodel.matrix
thinks a "to-factor" coercion is necessary from the specification of your model formula. For simplicity we can understand it as such: it is always coerced to a factor, but the result of applying contrasts may end up with the same model matrix as if it were handled as a dummy directly.Some people may wonder why "integer" is not included. Because an integer vector, like
1:4
, has a "numeric" mode (trymode(1:4)
).A data frame column may also be a matrix with "AsIs" class, but such a matrix must have "numeric" mode.
Our checking is to produce error when
and proceed to convert "logical" and "character" to "numeric" of "factor" class.
Note that if a data frame column is already a factor variable, it will not be included in
ind1
, as a factor variable has "numeric" mode (trymode(factor(letters[1:4]))
).step 3: drop unused factor levels
We won't have unused factor levels for factor variables converted from step 2, i.e., those indexed by
ind1
. However, factor variables that come withdat
might have unused levels (often as the result of step 0 and step 1). We need to drop any possible unused levels from them.step 4: summarize factor variables
Now we are ready to see what and how many factor levels are actually used by
lm
orglm
:To make your life easier, I've wrapped up those steps into a function
debug_contr_error
.Input:
dat
is your data frame passed tolm
orglm
viadata
argument;subset_vec
is the index vector passed tolm
orglm
viasubset
argument.Output: a list with
nlevels
(a list) gives the number of factor levels for all factor variables;levels
(a vector) gives levels for all factor variables.The function produces a warning, if there are no complete cases or no factor variables to summarize.
Here is a constructed tiny example.
Good, we see an error. Now my
debug_contr_error
exposes thatf2
ends up with a single level.Note that the original short answer is hopeless here, as
f2
is provided as a character variable not a factor variable.Now let's see an example with a matrix variable
x
.Note that a factor variable with no levels can cause an "contrasts error", too. You may wonder how a 0-level factor is possible. Well it is legitimate:
nlevels(factor(character(0)))
. Here you will end up with a 0-level factors if you have no complete cases.Finally let's see some a situation where if
f2
is a logical variable.Our debugger will predict a "contrasts error", but will it really happen?
No, at least this one does not fail (the
NA
coefficient is due to the rank-deficiency of the model; don't worry):It is difficult for me to come up with an example giving an error, but there is also no need. In practice, we don't use the debugger for prediction; we use it when we really get an error; and in that case, the debugger can locate the offending factor variable.
Perhaps some may argue that a logical variable is no different to a dummy. But try the simple example below: it does depends on your formula.
More flexible implementation using
"model.frame"
method oflm
You are also advised to go through R: how to debug "factor has new levels" error for linear model and prediction, which explains what
lm
andglm
do under the hood on your dataset. You will understand that steps 0 to 4 listed above are just trying to mimic such internal process. Remember, the data that are actually used for model fitting can be very different from what you've passed in.Our steps are not completely consistent with such internal processing. For a comparison, you can retrieve the result of the internal processing by using
method = "model.frame"
inlm
andglm
. Try this on the previously constructed tiny exampledat
wheref2
is a character variable.In practice,
model.frame
will only perform step 0 and step 1. It also drops variables provided in your dataset but not in your model formula. So a model frame may have both fewer rows and columns than what you feedlm
andglm
. Type coercing as done in our step 2 is done by the latermodel.matrix
where a "contrasts error" may be produced.There are a few advantages to first get this internal model frame, then pass it to
debug_contr_error
(so that it only essentially performs steps 2 to 4).advantage 1: variables not used in your model formula are ignored
advantage 2: able to cope with transformed variables
It is valid to transform variables in the model formula, and
model.frame
will record the transformed ones instead of the original ones. Note that, even if your original variable has noNA
, the transformed one can have.Given these benefits, I write another function wrapping up
model.frame
anddebug_contr_error
.Input:
form
is your model formula;dat
is the dataset passed tolm
orglm
viadata
argument;subset_vec
is the index vector passed tolm
orglm
viasubset
argument.Output: a list with
mf
(a data frame) gives the model frame (with "terms" attribute dropped);nlevels
(a list) gives the number of factor levels for all factor variables;levels
(a vector) gives levels for all factor variables.Try the previous
log
transform example.Try
subset_vec
as well.Model fitting per group and NA as factor levels
If you are fitting model by group, you are more likely to get a "contrasts error". You need to
?split.data.frame
);debug_contr_error2
(lapply
function can be helpful to do this loop).Some also told me that they can not use
na.omit
on their data, because it will end up too few rows to do anything sensible. This can be relaxed. In practice it is theNA_integer_
andNA_real_
that have to be omitted, butNA_character_
can be retained: just addNA
as a factor level. To achieve this, you need to loop through variables in your data frame:x
is already a factor andanyNA(x)
isTRUE
, dox <- addNA(x)
. The "and" is important. Ifx
has noNA
,addNA(x)
will add an unused<NA>
level.x
is a character, dox <- factor(x, exclude = NULL)
to coerce it to a factor.exclude = NULL
will retain<NA>
as a level.x
is "logical", "numeric", "raw" or "complex", nothing should be changed.NA
is justNA
.<NA>
factor level will not be dropped bydroplevels
orna.omit
, and it is valid for building a model matrix. Check the following examples.Once you add
NA
as a level in a factor / character, your dataset might suddenly have more complete cases. Then you can run your model. If you still get a "contrasts error", usedebug_contr_error2
to see what has happened.For your convenience, I write a function for this
NA
preprocessing.Input:
dat
is your full dataset.Output:
Reproducible case studies and Discussions
The followings are specially selected for reproducible case studies, as I just answered them with the three helper functions created here.
There are also a few other good-quality threads solved by other StackOverflow users:
This answer aims to debug the "contrasts error" during model fitting. However, this error can also turn up when using
predict
for prediction. Such behavior is not withpredict.lm
orpredict.glm
, but with predict methods from some packages. Here are a few related threads on StackOverflow.Also note that the philosophy of this answer is based on that of
lm
andglm
. These two functions are a coding standard for many model fitting routines, but maybe not all model fitting routines behave similarly. For example, the following does not look transparent to me whether my helper functions would actually be helpful.Although a bit off-topic, it is still useful to know that sometimes a "contrasts error" merely comes from writing a wrong piece of code. In the following examples, OP passed the name of their variables rather than their values to
lm
. Since a name is a single value character, it is later coerced to a single-level factor and causes the error.How to resolve this error after debugging?
In practice people want to know how to resolve this matter, either at a statistical level or a programming level.
If you are fitting models on your complete dataset, then there is probably no statistical solution, unless you can impute missing values or collect more data. Thus you may simply turn to a coding solution to drop the offending variable.
debug_contr_error2
returnsnlevels
which helps you easily locate them. If you don't want to drop them, replace them by a vector of 1 (as explained in How to do a GLM when "contrasts can be applied only to factors with 2 or more levels"?) and letlm
orglm
deal with the resulting rank-deficiency.If you are fitting models on subset, there can be statistical solutions.
Fitting models by group does not necessarily require you splitting your dataset by group and fitting independent models. The following may give you a rough idea:
If you do split your data explicitly, you can easily get "contrasts error", thus have to adjust your model formula per group (that is, you need to dynamically generate model formulae). A simpler solution is to skip building a model for this group.
You may also randomly partition your dataset into a training subset and a testing subset so that you can do cross-validation. R: how to debug "factor has new levels" error for linear model and prediction briefly mentions this, and you'd better do a stratified sampling to ensure the success of both model estimation on the training part and prediction on the testing part.