Understanding ROC curve

2019-08-23 05:00发布

问题:

import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc , roc_auc_score
import numpy as np

correct_classification = np.array([0,1])
predicted_classification = np.array([1,1])

false_positive_rate, true_positive_rate, tresholds = roc_curve(correct_classification, predicted_classification)

print(false_positive_rate)
print(true_positive_rate)

From https://en.wikipedia.org/wiki/Sensitivity_and_specificity :

True positive: Sick people correctly identified as sick 
False positive: Healthy people incorrectly identified as sick 
True negative: Healthy people correctly identified as healthy 
False negative: Sick people incorrectly identified as healthy

I'm using these values 0 : sick, 1 : healthy

From https://en.wikipedia.org/wiki/False_positive_rate :

flase positive rate = false positive / (false positive + true negative)

number of false positive : 0 number of true negative : 1

therefore false positive rate = 0 / 0 + 1 = 0

Reading the return value for roc_curve (http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html#sklearn.metrics.roc_curve) :

fpr : array, shape = [>2]

Increasing false positive rates such that element i is the false positive rate of predictions with score >= thresholds[i].

tpr : array, shape = [>2]

Increasing true positive rates such that element i is the true positive rate of predictions with score >= thresholds[i].

thresholds : array, shape = [n_thresholds]

Decreasing thresholds on the decision function used to compute fpr and tpr. thresholds[0] represents no instances being predicted and is arbitrarily set to max(y_score) + 1.

How is this a differing value to my manual calculation of false positive rate ? How are thresholds set ? Some mode information on thresholds is provided here : https://datascience.stackexchange.com/questions/806/advantages-of-auc-vs-standard-accuracy but I'm confused as to how it fits with this implementation ?

回答1:

First, the wikipedia is considering sick=1.

True positive: Sick people correctly identified as sick

Second, every model has some threshold based on probabilities of positive class (generally 0.5).

So if threshold is 0.1, all samples having probabilities greater than 0.1 will be classified as positive. The probabilities of the predicted samples are fixed and thresholds will be varied.

In the roc_curve, scikit-learn increases the threshold value from:

 0 (or minimum value where all the predictions are positive) 

to

1 (Or the last point where all predictions become negative).

Intermediate points are decided based on changes of predictions from positive to negative.

Example:

Sample 1      0.2
Sample 2      0.3
Sample 3      0.6
Sample 4      0.7
Sample 5      0.8

The lowest probability here is 0.2, so the minimum threshold to make any sense is 0.2. Now as we keep increasing the threshold, since there are very less points in this example, threshold points will be changed at each probability (and is equal to that probability, because thats the point where number of positives and negatives change)

                     Negative    Positive
               <0.2     0          5
Threshold1     >=0.2    1          4
Threshold2     >=0.3    2          3
Threshold3     >=0.6    3          2
Threshold4     >=0.7    4          1
Threshold5     >=0.8    5          0


回答2:

on the demo above, the threshold is the orange bar. the distribution of class 00 is in red (output of classifier) and the distribution of class 1 is in blue (same, proba distribution of output of classifier). it works with the probabilities of being in one class or the other: if one sample has [0.34,0.66] of output, then a threshold of 0.25 for class 1 will put him in class 1 even if the proba of 0.66 is higher.

You don't work on the ROC curve with classes but with probas of being in a class.

I hope it answers the question (sorry if not, i ll be more precise if needed)