I've build a model using caret. When the training was completed I got the following warning:
Warning message: In train.default(x, y, weights = w, ...) : At least one of the class levels are not valid R variables names; This may cause errors if class probabilities are generated because the variables names will be converted to: X0, X1
The names of the variables are:
str(train)
'data.frame': 7395 obs. of 30 variables:
$ alchemy_category : Factor w/ 13 levels "arts_entertainment",..: 2 8 6 6 11 6 1 6 3 8 ...
$ alchemy_category_score : num 3737 2052 4801 3816 3179 ...
$ avglinksize : num 2.06 3.68 2.38 1.54 2.68 ...
$ commonlinkratio_1 : num 0.676 0.508 0.562 0.4 0.5 ...
$ commonlinkratio_2 : num 0.206 0.289 0.322 0.1 0.222 ...
$ commonlinkratio_3 : num 0.0471 0.2139 0.1202 0.0167 0.1235 ...
$ commonlinkratio_4 : num 0.0235 0.1444 0.0426 0 0.0432 ...
$ compression_ratio : num 0.444 0.469 0.525 0.481 0.446 ...
$ embed_ratio : num 0 0 0 0 0 0 0 0 0 0 ...
$ frameTagRatio : num 0.0908 0.0987 0.0724 0.0959 0.0249 ...
$ hasDomainLink : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ html_ratio : num 0.246 0.203 0.226 0.266 0.229 ...
$ image_ratio : num 0.00388 0.08865 0.12054 0.03534 0.05047 ...
$ is_news : Factor w/ 2 levels "0","1": 2 2 2 2 2 1 2 1 2 1 ...
$ lengthyLinkDomain : Factor w/ 2 levels "0","1": 2 2 2 1 2 1 1 1 1 2 ...
$ linkwordscore : num 24 40 55 24 14 12 21 5 17 14 ...
$ news_front_page : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ non_markup_alphanum_characters: num 5424 4973 2240 2737 12032 ...
$ numberOfLinks : num 170 187 258 120 162 55 93 132 194 326 ...
$ numwords_in_url : num 8 9 11 5 10 3 3 4 7 4 ...
$ parametrizedLinkRatio : num 0.1529 0.1818 0.1667 0.0417 0.0988 ...
$ spelling_errors_ratio : num 0.0791 0.1254 0.0576 0.1009 0.0826 ...
$ label : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 2 1 2 2 ...
$ isVideo : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 1 1 ...
$ isFashion : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 2 1 2 1 ...
$ isFood : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ hasComments : Factor w/ 2 levels "0","1": 1 2 2 2 2 1 2 2 1 2 ...
$ hasGoogleAnalytics : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 2 2 2 1 ...
$ hasInlineCSS : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 2 1 2 2 ...
$ noOfMetaTags : num 10 12 6 10 13 2 6 6 9 5 ...
My code is the following:
ctrl <- trainControl(method = "CV",
number=10,
classProbs = TRUE,
allowParallel = TRUE,
summaryFunction = twoClassSummary)
set.seed(476)
rfFit <- train(formula,
data=train,
method = "rf",
tuneGrid = expand.grid(.mtry = seq(4,20,by=2)),
ntrees=1000,
importance = TRUE,
metric = "ROC",
trControl = ctrl)
pred <- predict.train(rfFit, newdata = test, type = "prob")
I get the error: Error in [.data.frame
(out, , obsLevels, drop = FALSE) :
undefined columns selected
The variable names on the test data set are:
str(test)
'data.frame': 3171 obs. of 29 variables:
$ alchemy_category : Factor w/ 13 levels "arts_entertainment",..: 8 4 12 4 10 12 12 8 1 2 ...
$ alchemy_category_score : num 5307 4825 1 6708 5416 ...
$ avglinksize : num 2.56 3.77 2.27 2.52 1.85 ...
$ commonlinkratio_1 : num 0.39 0.462 0.496 0.706 0.471 ...
$ commonlinkratio_2 : num 0.257 0.205 0.385 0.346 0.161 ...
$ commonlinkratio_3 : num 0.0441 0.0513 0.1709 0.123 0.0323 ...
$ commonlinkratio_4 : num 0.0221 0 0.1709 0.0906 0 ...
$ compression_ratio : num 0.49 0.782 1.25 0.449 0.454 ...
$ embed_ratio : num 0 0 0 0 0 0 0 0 0 0 ...
$ frameTagRatio : num 0.0671 0.0429 0.0588 0.0581 0.093 ...
$ hasDomainLink : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ html_ratio : num 0.23 0.366 0.162 0.147 0.244 ...
$ image_ratio : num 0.19944 0.08 10 0.00596 0.03571 ...
$ is_news : Factor w/ 2 levels "0","1": 2 1 1 2 2 1 1 2 1 1 ...
$ lengthyLinkDomain : Factor w/ 2 levels "0","1": 2 2 2 2 1 2 2 1 1 1 ...
$ linkwordscore : num 15 62 42 41 34 35 15 22 41 7 ...
$ news_front_page : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ non_markup_alphanum_characters: num 5643 382 2420 5559 2209 ...
$ numberOfLinks : num 136 39 117 309 155 266 55 145 110 1 ...
$ numwords_in_url : num 3 2 1 10 10 7 1 9 5 0 ...
$ parametrizedLinkRatio : num 0.2426 0.1282 0.5812 0.0388 0.0968 ...
$ spelling_errors_ratio : num 0.0806 0.1765 0.125 0.0631 0.0653 ...
$ isVideo : Factor w/ 2 levels "0","1": 1 2 1 2 2 2 1 1 2 2 ...
$ isFashion : Factor w/ 2 levels "0","1": 1 1 1 1 1 2 1 1 1 1 ...
$ isFood : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ hasComments : Factor w/ 2 levels "0","1": 2 1 1 2 2 2 1 2 2 1 ...
$ hasGoogleAnalytics : Factor w/ 2 levels "0","1": 1 2 2 2 2 1 1 2 1 1 ...
$ hasInlineCSS : Factor w/ 2 levels "0","1": 2 2 2 1 1 2 2 2 1 1 ...
$ noOfMetaTags : num 3 6 5 9 16 22 6 9 7 0 ...
If I omit the type="prob" part, I get no error.
Any ideas?
Could it be the length of the variable "alchemy_category" which is appended with the respective factor levels e.g. "alchemy_categoryarts_entertainment" inside the model??
I have read through the answers above while facing a similar problem. A formal solution is to do this on the train and test datasets. Make sure you include the response variable in the feature.names too.
This creates syntactically correct labels for all factors.
As stated above the class values must be factors and must be valid names. Another way to insure this is,
As per the above example, usually refactoring the outcome variable will fix the problem. It's better to change in the original dataset before partitioning into training and test datasets
levels <- unique(data$outcome) data$outcome <- factor(data$outcome, labels=make.names(levels))
As others pointed out earlier, this problem only occurs when classProbs=TRUE which causes the train function to generate additional statistics related to the outcome class
As @Sam Firke already pointed out in comments (but I overlooked it) levels TRUE/FALSE also don't work. So I converted them to yes/no.
The answer is in bold at the top of your post =]
What are you modeling? Is it
alchemy_category
? The code only saysformula
and we can't see it.When you ask for class probabilities, model predictions are a data frame with separate columns for each class/level. If
alchemy_category
doesn't have levels that are valid column names,data.frame
converts then to valid names. That creates a problem because the code is looking for a specific name but the data frame as a different (but valid) name.For example, if I had
the code would be looking for "level 2" but there is only "level.2".