As part of the Enron project, built the attached model, Below is the summary of the steps,
Below model gives highly perfect scores
cv = StratifiedShuffleSplit(n_splits = 100, test_size = 0.2, random_state = 42)
gcv = GridSearchCV(pipe, clf_params,cv=cv)
gcv.fit(features,labels) ---> with the full dataset
for train_ind, test_ind in cv.split(features,labels):
x_train, x_test = features[train_ind], features[test_ind]
y_train, y_test = labels[train_ind],labels[test_ind]
gcv.best_estimator_.predict(x_test)
Below model gives more reasonable but low scores
cv = StratifiedShuffleSplit(n_splits = 100, test_size = 0.2, random_state = 42)
gcv = GridSearchCV(pipe, clf_params,cv=cv)
gcv.fit(features,labels) ---> with the full dataset
for train_ind, test_ind in cv.split(features,labels):
x_train, x_test = features[train_ind], features[test_ind]
y_train, y_test = labels[train_ind],labels[test_ind]
gcv.best_estimator_.fit(x_train,y_train)
gcv.best_estimator_.predict(x_test)
Used Kbest to find out the scores and sorted the features and trying a combination of higher and lower scores.
Used SVM with a GridSearch using a StratifiedShuffle
Used the best_estimator_ to predict and calculate the precision and recall.
The problem is estimator is spitting out perfect scores, in some case 1
But when I refit the best classifier on training data then run the test it gives reasonable scores.
My doubt/question was what exactly GridSearch does with the test data after the split using the Shuffle split object we send in to it. I assumed it would not fit anything on Test data, if that was true then when I predict using the same test data, it should not give this high scores right.? since i used random_state value, the shufflesplit should have created the same copy for the Grid fit and also for the predict.
So, is using the same Shufflesplit for two wrong?