Relative newbie to predictive modeling--most of my training/experience is in inferential stats. I'm trying to predict student college graduation in 4 years.
Basic issue is that I've done data cleaning (imputing, centering, scaling); split that processed/transformed data into training (70%) and testing (30%) sets; balanced the data using two approaches (because data was 65%=0, 35%=1--and I've found inconsistent advice on what classifies as unbalanced, but one source suggested anything not within 40/60 range)--ROSE "BOTH" and SMOTE; and ran random forests.
For the ROSE "BOTH" models I got 0.9242 accuracy on the training set and AUC of 0.9268 for the test set.
For the SMOTE model I got 0.9943 accuracy on the training set and AUC of 0.9971 on the test set.
More details on model performance are embedded in the code copied below.
This just seems too good to be true. But, from what I've been able to find slightly improved performance on the test set would not indicate overfitting (it'd be the other way around). So, is this models performance likely really good or is it too good to be true? I have not been able to find a direct answer to this question via SO searches.
Also, in a few weeks I'll have another cohort of data I can run this on. I suppose that could be another "test" set, correct? Then I can apply this to the newest cohort for which we are interested in knowing likelihood to graduate in 4 years.
Many thanks, Brian
#Used for predictive modeling of 4-year graduation
#IMPORT DATA
library(haven)
grad4yr <- [file path]
#DETERMINE DATA BALANCE/UNBALANCE
prop.table(table(grad4yr$graduate_4_yrs))
# 0=0.6492, 1=0.3517
#convert to factor so next step doesn't impute outcome variable
grad4yr$graduate_4_yrs <- as.factor(grad4yr$graduate_4_yrs)
#Preprocess data, RANN package used
library('RANN')
#Create proprocessed values object which includes centering, scaling, and imputing missing values using KNN
Processed_Values <- preProcess(grad4yr, method = c("knnImpute","center","scale"))
#Create new dataset with imputed values and centering/scaling
#Confirmed this results in 0 cases with missing values
grad4yr_data_processed <- predict(Processed_Values, grad4yr)
#Confirm last step results in 0 cases with missing values
sum(is.na(grad4yr_data_processed))
#[1] 0
#Convert outcome variable to numeric to ensure dummify step (next) doesn't dummify outcome variable.
grad4yr_data_processed$graduate_4_yrs <- as.factor(grad4yr_data_processed$graduate_4_yrs)
#Convert all factor variables to dummy variables; fullrank used to omit one of new dummy vars in each
#set.
dmy <- dummyVars("~ .", data = grad4yr_data_processed, fullRank = TRUE)
#Create new dataset that has the data imputed AND transformed to have dummy variables for all variables that
#will go in models.
grad4yr_processed_transformed <- data.frame(predict(dmy,newdata = grad4yr_data_processed))
#Convert outcome variable back to binary/factor for predictive models and create back variable with same name
#not entirely sure who last step created new version of outcome var with ".1" at the end
grad4yr_processed_transformed$graduate_4_yrs.1 <- as.factor(grad4yr_processed_transformed$graduate_4_yrs.1)
grad4yr_processed_transformed$graduate_4_yrs <- as.factor(grad4yr_processed_transformed$graduate_4_yrs)
grad4yr_processed_transformed$graduate_4_yrs.1 <- NULL
#Split data into training and testing/validation datasets based on outcome at 70%/30%
index <- createDataPartition(grad4yr_processed_transformed$graduate_4_yrs, p=0.70, list=FALSE)
trainSet <- grad4yr_processed_transformed[index,]
testSet <- grad4yr_processed_transformed[-index,]
#load caret
library(caret)
#Feature selection using rfe in R Caret, used with profile/comparison
control <- rfeControl(functions = rfFuncs,
method = "repeatedcv",
repeats = 10,#using k=10 per Kuhn & Johnson pp70; and per James et al pp
#https://www-bcf.usc.edu/~gareth/ISL/ISLR%20First%20Printing.pdf
verbose = FALSE)
#create traincontrol using repeated cross-validation with 10 fold 5 times
fitControl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 5,
search = "random")
#Set the outcome variable object
grad4yrs <- 'graduate_4_yrs'
#set predictor variables object
predictors <- names(trainSet[!names(trainSet) %in% grad4yrs])
#create predictor profile to see what where prediction is best (by num vars)
grad4yr_pred_profile <- rfe(trainSet[,predictors],trainSet[,grad4yrs],rfeControl = control)
# Recursive feature selection
#
# Outer resampling method: Cross-Validated (10 fold, repeated 5 times)
#
# Resampling performance over subset size:
#
# Variables Accuracy Kappa AccuracySD KappaSD Selected
# 4 0.6877 0.2875 0.03605 0.08618
# 8 0.7057 0.3078 0.03461 0.08465 *
# 16 0.7006 0.2993 0.03286 0.08036
# 40 0.6949 0.2710 0.03330 0.08157
#
# The top 5 variables (out of 8):
# Transfer_Credits, HS_RANK, Admit_Term_Credits_Taken, first_enroll, Admit_ReasonUT10
#see data structure
str(trainSet)
#not copying output here, but confirms outcome var is factor and everything else is numeric
#given 65/35 split on outcome var and what can find about unbalanced data, considering unbalanced and doing steps to balance.
#using ROSE "BOTH and SMOTE to see how differently they perform. Also ran under/over with ROSE but they didn't perform nearly as
#well so removed from this script.
#SMOTE to balance data on the processed/dummified dataset
library(DMwR)#https://www3.nd.edu/~dial/publications/chawla2005data.pdf for justification
train.SMOTE <- SMOTE(graduate_4_yrs ~ ., data=grad4yr_processed_transformed, perc.over=600, perc.under=100)
#see how balanced SMOTE resulting dataset is
prop.table(table(train.SMOTE$graduate_4_yrs))
#0 1
#0.4615385 0.5384615
#open ROSE package/library
library("ROSE")
#ROSE to balance data (using BOTH) on the processed/dummified dataset
train.both <- ovun.sample(graduate_4_yrs ~ ., data=grad4yr_processed_transformed, method = "both", p=.5,
N = 2346)$data
#see how balanced BOTH resulting dataset is
prop.table(table(train.both$graduate_4_yrs))
#0 1
#0.4987212 0.5012788
#ROSE to balance data (using BOTH) on the processed/dummified dataset
table(grad4yr_processed_transformed$graduate_4_yrs)
#0 1
#1144 618
library("caret")
#create random forests using balanced data from above
RF_model_both <- train(train.both[,predictors],train.both[, grad4yrs],method = 'rf', trControl = fitControl, ntree=1000, tuneLength = 10)
#print info on accuracy & kappa for "BOTH" training model
# print(RF_model_both)
# Random Forest
#
# 2346 samples
# 40 predictor
# 2 classes: '0', '1'
#
# No pre-processing
# Resampling: Cross-Validated (10 fold, repeated 5 times)
# Summary of sample sizes: 2112, 2111, 2111, 2112, 2111, 2112, ...
# Resampling results across tuning parameters:
#
# mtry Accuracy Kappa
# 8 0.9055406 0.8110631
# 11 0.9053719 0.8107246
# 12 0.9057981 0.8115770
# 13 0.9054584 0.8108965
# 14 0.9048602 0.8097018
# 20 0.9034992 0.8069796
# 26 0.9027307 0.8054427
# 30 0.9034152 0.8068113
# 38 0.9023899 0.8047622
# 40 0.9032428 0.8064672
# Accuracy was used to select the optimal model using the largest value.
# The final value used for the model was mtry = 12.
RF_model_SMOTE <- train(train.SMOTE[,predictors],train.SMOTE[, grad4yrs],method = 'rf', trControl = fitControl, ntree=1000, tuneLength = 10)
#print info on accuracy & kappa for "SMOTE" training model
# print(RF_model_SMOTE)
# Random Forest
#
# 8034 samples
# 40 predictor
# 2 classes: '0', '1'
#
# No pre-processing
# Resampling: Cross-Validated (10 fold, repeated 5 times)
# Summary of sample sizes: 7231, 7231, 7230, 7230, 7231, 7231, ...
# Resampling results across tuning parameters:
#
# mtry Accuracy Kappa
# 17 0.9449082 0.8899939
# 19 0.9458047 0.8917740
# 21 0.9458543 0.8918695
# 29 0.9470243 0.8941794
# 31 0.9468750 0.8938864
# 35 0.9468003 0.8937290
# 36 0.9463772 0.8928876
# 40 0.9463275 0.8927828
#
# Accuracy was used to select the optimal model using the largest value.
# The final value used for the model was mtry = 29.
#Given that both accuracy and kappa appear better in the "SMOTE" random forest it's looking like it's the better model.
#But, running ROC/AUC on both to see how they both perform on validation data.
#Create predictions based on random forests above
rf_both_predictions <- predict.train(object=RF_model_both,testSet[, predictors], type ="raw")
rf_SMOTE_predictions <- predict.train(object=RF_model_SMOTE,testSet[, predictors], type ="raw")
#Create predictions based on random forests above
rf_both_pred_prob <- predict.train(object=RF_model_both,testSet[, predictors], type ="prob")
rf_SMOTE_pred_prob <- predict.train(object=RF_model_SMOTE,testSet[, predictors], type ="prob")
#create Random Forest confusion matrix to evaluate random forests
confusionMatrix(rf_both_predictions,testSet[,grad4yrs], positive = "1")
#output copied here:
# Confusion Matrix and Statistics
#
# Reference
# Prediction 0 1
# 0 315 12
# 1 28 173
#
# Accuracy : 0.9242
# 95% CI : (0.8983, 0.9453)
# No Information Rate : 0.6496
# P-Value [Acc > NIR] : < 2e-16
#
# Kappa : 0.8368
# Mcnemar's Test P-Value : 0.01771
#
# Sensitivity : 0.9351
# Specificity : 0.9184
# Pos Pred Value : 0.8607
# Neg Pred Value : 0.9633
# Prevalence : 0.3504
# Detection Rate : 0.3277
# Detection Prevalence : 0.3807
# Balanced Accuracy : 0.9268
#
# 'Positive' Class : 1
# confusionMatrix(rf_under_predictions,testSet[,grad4yrs], positive = "1")
#output copied here:
#Accuracy : 0.8258
#only copied accuracy as it was fair below two other versions
confusionMatrix(rf_SMOTE_predictions,testSet[,grad4yrs], positive = "1")
#output copied here:
# Confusion Matrix and Statistics
#
# Reference
# Prediction 0 1
# 0 340 0
# 1 3 185
#
# Accuracy : 0.9943
# 95% CI : (0.9835, 0.9988)
# No Information Rate : 0.6496
# P-Value [Acc > NIR] : <2e-16
#
# Kappa : 0.9876
# Mcnemar's Test P-Value : 0.2482
#
# Sensitivity : 1.0000
# Specificity : 0.9913
# Pos Pred Value : 0.9840
# Neg Pred Value : 1.0000
# Prevalence : 0.3504
# Detection Rate : 0.3504
# Detection Prevalence : 0.3561
# Balanced Accuracy : 0.9956
#
# 'Positive' Class : 1
#put predictions in dataset
testSet$rf_both_pred <- rf_both_predictions#predictions (BOTH)
testSet$rf_SMOTE_pred <- rf_SMOTE_predictions#probabilities (BOTH)
testSet$rf_both_prob <- rf_both_pred_prob#predictions (SMOTE)
testSet$rf_SMOTE_prob <- rf_SMOTE_pred_prob#probabilities (SMOTE)
library(pROC)
#get AUC of the BOTH predictions
testSet$rf_both_pred <- as.numeric(testSet$rf_both_pred)
Both_ROC_Curve <- roc(response = testSet$graduate_4_yrs,
predictor = testSet$rf_both_pred,
levels = rev(levels(testSet$graduate_4_yrs)))
auc(Both_ROC_Curve)
# Area under the curve: 0.9268
#get AUC of the SMOTE predictions
testSet$rf_SMOTE_pred <- as.numeric(testSet$rf_SMOTE_pred)
SMOTE_ROC_Curve <- roc(response = testSet$graduate_4_yrs,
predictor = testSet$rf_SMOTE_pred,
levels = rev(levels(testSet$graduate_4_yrs)))
auc(SMOTE_ROC_Curve)
#Area under the curve: 0.9971
#So, the SMOTE balanced data performed very well on training data and near perfect on the validation/test data.
#But, it seems almost too good to be true.
#Is there anything I might have missed or performed incorrectly?
I'll post as an answer my comment, even if this might be migrated.
I really think that you're overfitting, because you have balanced on the whole dataset. Instead you should balance only the train set.
Here is your code:
By doing so your
train.SMOTE
now contains information from the test set too, so when you'll test on yourtestSet
the model will have already seen part of the data, and this will likely be the cause of your "too good" results.It should be: