I am trying to estimate a logistic regression, using the 10-fold cross-validation.
#import libraries
library(car); library(caret); library(e1071); library(verification)
#data import and preparation
data(Chile)
chile <- na.omit(Chile) #remove "na's"
chile <- chile[chile$vote == "Y" | chile$vote == "N" , ] #only "Y" and "N" required
chile$vote <- factor(chile$vote) #required to remove unwanted levels
chile$income <- factor(chile$income) # treat income as a factor
Goal is to estimate a glm - model that predicts to outcome of vote "Y" or "N" depended on relevant explanatory variables and, based on the final model, compute a confusion matrix and ROC curve to grasp the models behaviour for different threshold levels.
Model selection leads to:
res.chileIII <- glm(vote ~
sex +
education +
statusquo ,
family = binomial(),
data = chile)
#prediction
chile.pred <- predict.glm(res.chileIII, type = "response")
generates:
> head(chile.pred)
1 2 3 4 5 6
0.974317861 0.008376988 0.992720134 0.095014139 0.040348115 0.090947144
to compare the observed with estimation:
chile.v <- ifelse(chile$vote == "Y", 1, 0) #to compare the two arrays
chile.predt <- function(t) ifelse(chile.pred > t , 1,0) #t is the threshold for which the confusion matrix shall be computed
confusion matrix for t = 0.3:
confusionMatrix(chile.predt(0.3), chile.v)
> confusionMatrix(chile.predt(0.3), chile.v)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 773 44
1 94 792
Accuracy : 0.919
95% CI : (0.905, 0.9315)
No Information Rate : 0.5091
P-Value [Acc > NIR] : < 2.2e-16
and the Roc-curve:
roc.plot(chile.v, chile.pred)
which seems as a reasonable model.
Now instead of using the "normal" predict.glm() function I want to test out the performance difference to a 10-fold cross-validation estimation.
tc <- trainControl("cv", 10, savePredictions=T) #"cv" = cross-validation, 10-fold
fit <- train(chile$vote ~ chile$sex +
chile$education +
chile$statusquo ,
data = chile ,
method = "glm" ,
family = binomial ,
trControl = tc)
> summary(fit)$coef
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.0152702 0.1889646 5.372805 7.752101e-08
`chile$sexM` -0.5742442 0.2022308 -2.839549 4.517738e-03
`chile$educationPS` -1.1074079 0.2914253 -3.799971 1.447128e-04
`chile$educationS` -0.6827546 0.2217459 -3.078996 2.076993e-03
`chile$statusquo` 3.1689305 0.1447911 21.886224 3.514468e-106
all parameters significant.
fitpred <- ifelse(fit$pred$pred == "Y", 1, 0) #to compare with chile.v
> confusionMatrix(fitpred,chile.v)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 445 429
1 422 407
Accuracy : 0.5003
95% CI : (0.4763, 0.5243)
No Information Rate : 0.5091
P-Value [Acc > NIR] : 0.7738
which is obviously very different from the previous confusion matrix. My expectation was that the cross validated results should not perform much worse then the first model. However the results show something else.
My assumption is that there is a mistake with the settings of the train() parameters but I can't figure it out what it is.
I would really appreciate some help, thank you in advance.