I would like to use cross validation to test/train my dataset and evaluate the performance of the logistic regression model on the entire dataset and not only on the test set (e.g. 25%).
These concepts are totally new to me and am not very sure if am doing it right. I would be grateful if anyone could advise me on the right steps to take where I have gone wrong. Part of my code is shown below.
Also, how can I plot ROCs for "y2" and "y3" on the same graph with the current one?
Thank you
import pandas as pd
Data=pd.read_csv ('C:\\Dataset.csv',index_col='SNo')
feature_cols=['A','B','C','D','E']
X=Data[feature_cols]
Y=Data['Status']
Y1=Data['Status1'] # predictions from elsewhere
Y2=Data['Status2'] # predictions from elsewhere
from sklearn.linear_model import LogisticRegression
logreg=LogisticRegression()
logreg.fit(X_train,y_train)
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
from sklearn import metrics, cross_validation
predicted = cross_validation.cross_val_predict(logreg, X, y, cv=10)
metrics.accuracy_score(y, predicted)
from sklearn.cross_validation import cross_val_score
accuracy = cross_val_score(logreg, X, y, cv=10,scoring='accuracy')
print (accuracy)
print (cross_val_score(logreg, X, y, cv=10,scoring='accuracy').mean())
from nltk import ConfusionMatrix
print (ConfusionMatrix(list(y), list(predicted)))
#print (ConfusionMatrix(list(y), list(yexpert)))
# sensitivity:
print (metrics.recall_score(y, predicted) )
import matplotlib.pyplot as plt
probs = logreg.predict_proba(X)[:, 1]
plt.hist(probs)
plt.show()
# use 0.5 cutoff for predicting 'default'
import numpy as np
preds = np.where(probs > 0.5, 1, 0)
print (ConfusionMatrix(list(y), list(preds)))
# check accuracy, sensitivity, specificity
print (metrics.accuracy_score(y, predicted))
#ROC CURVES and AUC
# plot ROC curve
fpr, tpr, thresholds = metrics.roc_curve(y, probs)
plt.plot(fpr, tpr)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate)')
plt.show()
# calculate AUC
print (metrics.roc_auc_score(y, probs))
# use AUC as evaluation metric for cross-validation
from sklearn.cross_validation import cross_val_score
logreg = LogisticRegression()
cross_val_score(logreg, X, y, cv=10, scoring='roc_auc').mean()
You got it almost right.
cross_validation.cross_val_predict
gives you predictions for the entire dataset. You just need to removelogreg.fit
earlier in the code. Specifically, what it does is the following: It divides your dataset in ton
folds and in each iteration it leaves one of the folds out as the test set and trains the model on the rest of the folds (n-1
folds). So, in the end you will get predictions for the entire data.Let's illustrate this with one of the built-in datasets in sklearn, iris. This dataset contains 150 training samples with 4 features.
iris['data']
isX
andiris['target']
isy
To get predictions on the entire set with cross validation you can do the following:
So, back to your code. All you need is this:
For plotting ROC in multi-class classification, you can follow this tutorial which gives you something like the following:
In general, sklearn has very good tutorials and documentation. I strongly recommend reading their tutorial on cross_validation.