Often, I want to run a cross validation on a dataset which contains some factor variables and after running for a while, the cross validation routine fails with the error: factor x has new levels Y
.
For example, using package boot:
library(boot)
d <- data.frame(x=c('A', 'A', 'B', 'B', 'C', 'C'), y=c(1, 2, 3, 4, 5, 6))
m <- glm(y ~ x, data=d)
m.cv <- cv.glm(d, m, K=2) # Sometimes succeeds
m.cv <- cv.glm(d, m, K=2)
# Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
# factor x has new levels B
Update: This is a toy example. The same problem occurs with larger datasets as well, where there are several occurrences of level C
but none of them is present in the training partition.
The function createDataPartition
function from the package caret
does stratified sampling for the outcome variables and correctly warns:
Also, for ‘createDataPartition’, very small class sizes (<= 3) the classes may not show up in both the training and test data.
There are two solutions which spring to mind:
- First, create a subset of the data by selecting one random sample of each
factor level
first, starting from the rarest class (by frequency) and then greedily satisfying the next rare class and so on. Then usingcreateDataPartition
on the rest of the dataset and merging the results to create a new train dataset which contains alllevels
. - Using
createDataPartitions
and and doing rejection sampling.
So far, option 2 has worked for me because of the data sizes, but I cannot help but think that there must be a better solution than a hand rolled out one.
Ideally, I would want a solution which just works for creating partitions and fails early if there is no way to create such partitions.
Is there a fundamental theoretical reason why packages do not offer this? Do they offer it and I just haven't been able to spot them because of a blind spot? Is there a better way of doing this stratified sampling?
Please leave a comment if I should ask this question on stats.stackoverflow.com.
Update:
This is what my hand rolled out solution (2) looks like:
get.cv.idx <- function(train.data, folds, factor.cols = NA) {
if (is.na(factor.cols)) {
all.cols <- colnames(train.data)
factor.cols <- all.cols[laply(llply(train.data[1, ], class), function (x) 'factor' %in% x)]
}
n <- nrow(train.data)
test.n <- floor(1 / folds * n)
cond.met <- FALSE
n.tries <- 0
while (!cond.met) {
n.tries <- n.tries + 1
test.idx <- sample(nrow(train.data), test.n)
train.idx <- setdiff(1:nrow(train.data), test.idx)
cond.met <- TRUE
for(factor.col in factor.cols) {
train.levels <- train.data[ train.idx, factor.col ]
test.levels <- train.data[ test.idx , factor.col ]
if (length(unique(train.levels)) < length(unique(test.levels))) {
cat('Factor level: ', factor.col, ' violated constraint, retrying.\n')
cond.met <- FALSE
}
}
}
cat('Done in ', n.tries, ' trie(s).\n')
list( train.idx = train.idx
, test.idx = test.idx
)
}
When I call traceback I get this:
And looking at the
cv.glm
function gives:It seems the problem has to do with your extremely small sample size and categorical effect (with values "A", "B", and "C"). You are fitting a glm with 2 effects: "B:A" and "C:A". In each CV iteration you bootstrap from the sample dataset and fit a new model
d.glm
. Given the size, the bootstrapped data are guaranteed to come up with 1 or more iteration in which the value "C" is not sampled, hence the error comes from calculating fitted probabilities from the bootstrap model from the training data in which validation data has a "C" level for x not observed in the training data.Frank Harrell (often on stats.stackexchange.com) wrote in Regression Modelling Strategies that one ought to favor against split sample validation when sample size is small and/or some cell counts are small in categorical data analysis. Singularity (as you are seeing here) is one of many reasons why I think this is true.
Given the small sample size here, you should consider some split sample cross validation alternatives like a permutation test, or a parametric bootstrap. Another important consideration is exactly why you feel model based inference isn't correct. As Tukey said of the bootstrap, he'd like to call it a shotgun. It will blow the head off of any problem, as long as you're willing to reassemble the pieces.
Everyone agrees that there sure is an optimal solution. But personally, I would just
try
thecv.glm
call until it works usingwhile
.I've tried it with 100,000 rows in the data.fame and it only takes a few seconds.