I am new in machine learning and in scikit-learn.
My problem:
(Please, correct any type of missconception)
I have a dataset which is a BIG JSON, I retrieve it and store it in a trainList
variable.
I pre-process it in order to be able to work with it.
Once I have done that, I start the classification:
- I use kfold cross validation method in order to obtain the mean accuracy and I train a classifier.
- I make the predicctions and I obtain the accuracy and confusion matrix of that fold.
- After this, I would like to obtain the True Positive(TP), True Negative(TN), False Positive(FP) and False Negative(FN) values. I would use these paramters to obtain the Sensitivity and the specificity and I would them and the total of the TPs to a HTML in order to show a chart with the TPs of each label.
Code:
The variables I have for the moment:
trainList #It is a list with all the data of my dataset in JSON form
labelList #It is a list with all the labels of my data
Most part of the method:
#I transform the data from JSON form to a numerical one
X=vec.fit_transform(trainList)
#I scale the matrix (don't know why but without it, it makes an error)
X=preprocessing.scale(X.toarray())
#I generate a KFold in order to make cross validation
kf = KFold(len(X), n_folds=10, indices=True, shuffle=True, random_state=1)
#I start the cross validation
for train_indices, test_indices in kf:
X_train=[X[ii] for ii in train_indices]
X_test=[X[ii] for ii in test_indices]
y_train=[listaLabels[ii] for ii in train_indices]
y_test=[listaLabels[ii] for ii in test_indices]
#I train the classifier
trained=qda.fit(X_train,y_train)
#I make the predictions
predicted=qda.predict(X_test)
#I obtain the accuracy of this fold
ac=accuracy_score(predicted,y_test)
#I obtain the confusion matrix
cm=confusion_matrix(y_test, predicted)
#I should calculate the TP,TN, FP and FN
#I don't know how to continue
you can try
sklearn.metrics.classification_report
as below:output:
I wrote a version that works using only numpy. I hope it helps you.
The one liner to get true postives etc. out of the confusion matrix is to ravel it:
You can obtain all of the parameters from the confusion matrix. The structure of the confusion matrix(which is 2X2 matrix) is as follows
So
More details at https://en.wikipedia.org/wiki/Confusion_matrix
I think both of the answers are not fully correct. For example, suppose that we have the following arrays;
y_actual = [1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0]
y_predic = [1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0]
If we compute the FP, FN, TP and TN values manually, they should be as follows:
FP: 3 FN: 1 TP: 3 TN: 4
However, if we use the first answer, results are given as follows:
FP: 1 FN: 3 TP: 3 TN: 4
They are not correct, because in the first answer, False Positive should be where actual is 0, but the predicted is 1, not the opposite. It is also same for False Negative.
And, if we use the second answer, the results are computed as follows:
FP: 3 FN: 1 TP: 4 TN: 3
True Positive and True Negative numbers are not correct, they should be opposite.
Am I correct with my computations? Please let me know if I am missing something.
In the scikit-learn 'metrics' library there is a confusion_matrix method which gives you the desired output.
You can use any classifier that you want. Here I used the KNeighbors as example.
The docs: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html#sklearn.metrics.confusion_matrix