I am able to get a ROC curve using scikit-learn
with
fpr
, tpr
, thresholds = metrics.roc_curve(y_true,y_pred, pos_label=1)
, where y_true
is a list of values based on my gold standard (i.e., 0
for negative and 1
for positive cases) and y_pred
is a corresponding list of scores (e.g., 0.053497243
, 0.008521122
, 0.022781548
, 0.101885263
, 0.012913795
, 0.0
, 0.042881547
[...])
I am trying to figure out how to add confidence intervals to that curve, but didn't find any easy way to do that with sklearn.
DeLong Solution [NO bootstrapping]
As some of here suggested, a
pROC
approach would be nice. According topROC
documentation, confidence intervals are calculated via DeLong:Yandex Data School has a Fast DeLong implementation on their public repo:
https://github.com/yandexdataschool/roc_comparison
So all credits to them for the DeLong implementation used in this example. So here is how you get a CI via DeLong:
output:
I've also checked that this implementation matches the
pROC
results obtained fromR
:output:
Then
output
You can bootstrap the roc computations (sample with replacement new versions of
y_true
/y_pred
out of the originaly_true
/y_pred
and recompute a new value forroc_curve
each time) and the estimate a confidence interval this way.To take the variability induced by the train test split into account, you can also use the ShuffleSplit CV iterator many times, fit a model on the train split, generate
y_pred
for each model and thus gather an empirical distribution ofroc_curve
s as well and finally compute confidence intervals for those.Edit: boostrapping in python
Here is an example for bootstrapping the ROC AUC score out of the predictions of a single model. I chose to bootstap the ROC AUC to make it easier to follow as a Stack Overflow answer, but it can be adapted to bootstrap the whole curve instead:
You can see that we need to reject some invalid resamples. However on real data with many predictions this is a very rare event and should not impact the confidence interval significantly (you can try to vary the
rng_seed
to check).Here is the histogram:
Note that the resampled scores are censored in the [0 - 1] range causing a high number of scores in the last bin.
To get a confidence interval one can sort the samples:
which gives:
The confidence interval is very wide but this is probably a consequence of my choice of predictions (3 mistakes out of 9 predictions) and the total number of predictions quite small.
Another remark on the plot: the scores are quantized (many empty histogram bins). This is a consequence of the small number of predictions. One could introduce a bit of Gaussian noise on the scores (or the
y_pred
values) to smooth the distribution and make the histogram look better. But then the choice of the smoothing bandwidth is tricky.Finally as stated earlier this confidence interval is specific to you training set. To get a better estimate of the variability of the ROC of induced by your model class and parameters, you should do iterated cross-validation instead. However this is often much more costly as you need to train a new model for each random train / test split.