How to choose optimal threshold for class probabil

My output of neural network is table of predicted class probabilities for multi-label classification:

print(probabilities)

|   |      1       |      3       | ... |     8354     |     8356     |     8357     |
|---|--------------|--------------|-----|--------------|--------------|--------------|
| 0 | 2.442745e-05 | 5.952136e-06 | ... | 4.254002e-06 | 1.894523e-05 | 1.033957e-05 |
| 1 | 7.685694e-05 | 3.252202e-06 | ... | 3.617730e-06 | 1.613792e-05 | 7.356643e-06 |
| 2 | 2.296657e-06 | 4.859554e-06 | ... | 9.934525e-06 | 9.244772e-06 | 1.377618e-05 |
| 3 | 5.163169e-04 | 1.044035e-04 | ... | 1.435158e-04 | 2.807420e-04 | 2.346930e-04 |
| 4 | 2.484626e-06 | 2.074290e-06 | ... | 9.958628e-06 | 6.002510e-06 | 8.434519e-06 |
| 5 | 1.297477e-03 | 2.211737e-04 | ... | 1.881772e-04 | 3.171079e-04 | 3.228884e-04 |

I converted it to class labels using a threshold (0.2) for measuring accuraccy of my prediction:

predictions = (probabilities > 0.2).astype(np.int)
print(predictions)

|   | 1 | 3 | ... | 8354 | 8356 | 8357 |
|---|---|---|-----|------|------|------|
| 0 | 0 | 0 | ... |    0 |    0 |    0 |
| 1 | 0 | 0 | ... |    0 |    0 |    0 |
| 2 | 0 | 0 | ... |    0 |    0 |    0 |
| 3 | 0 | 0 | ... |    0 |    0 |    0 |
| 4 | 0 | 0 | ... |    0 |    0 |    0 |
| 5 | 0 | 0 | ... |    0 |    0 |    0 |

Also I have a test set:

print(Y_test)

|   | 1 | 3 | ... | 8354 | 8356 | 8357 |
|---|---|---|-----|------|------|------|
| 0 | 0 | 0 | ... |    0 |    0 |    0 |
| 1 | 0 | 0 | ... |    0 |    0 |    0 |
| 2 | 0 | 0 | ... |    0 |    0 |    0 |
| 3 | 0 | 0 | ... |    0 |    0 |    0 |
| 4 | 0 | 0 | ... |    0 |    0 |    0 |
| 5 | 0 | 0 | ... |    0 |    0 |    0 |

Question: How to build an algorithm in Python that will choose the optimal threshold that maximize roc_auc_score(average = 'micro') or another metrics?

Maybe it is possible to build manual function in Python that optimize threshold, depending on the accuracy metric.

标签： python machine-learning scikit-learn neural-network

2条回答

看我几分像从前

2楼-- · 2019-07-30 11:09

I assume your groundtruth labels are Y_test and predictions are predictions.

Optimizing roc_auc_score(average = 'micro') according to a prediction threshold does not seem to make sense as AUCs are computed based on how predictions are ranked and therefore need predictions as float values in [0,1].

Therefore, I will discuss accuracy_score.

You could use scipy.optimize.fmin:

def thr_to_accuracy(thr, Y_test, predictions):
   return -accuracy_score(Y_test, np.array(predictions>thr, dtype=np.int))

best_thr = scipy.optimize.fmin(thr_to_accuracy, args=(Y_test, predictions), x0=0.5)

0人赞添加讨论(0) 举报

萌系小妹纸

3楼-- · 2019-07-30 11:09

the best way to do so is to put a logistic regression on top of your new dataset. It will multiply every probability by a certain constant and thus will provide an automatic threshold on the output (with the LR you just need to predict the class not the probabilities)

You need to train this by subdividing the Test set in two and use one part to train the LR after predicting the output with the NN.

This is not the only way to do it, but it works fine for me everytime.

we have X_train_nn,X_valid_nn,X_test_NN and we subdivide X_test_NN in X_train_LR, X_test_LR (or do a Stratified Kfold as you wish) here is a sample of the code

X_train = NN.predict_proba(X_train_LR)
X_test = NN.predict_proba(X_test_LR)
logistic = linear_model.LogisticRegression(C=1.0, penalty = 'l2')
logistic.fit(X_train,Y_train)
logistic.score(X_test,Y_test)

You condider you output as a new dataset and train a LR on this new dataset.

0人赞添加讨论(0) 举报

How to choose optimal threshold for class probabil

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间