Classification accuracy after recall and precision

I'm just wondering if this is a legitimate way of calculating classification accuracy:

obtain precision recall thresholds
for each threshold binarize the continuous y_scores
calculate their accuracy from the contingency table (confusion matrix)

return the average accuracy for the thresholds

recall, precision, thresholds = precision_recall_curve(np.array(np_y_true), np.array(np_y_scores))
accuracy = 0
for threshold in thresholds:
    contingency_table = confusion_matrix(np_y_true, binarize(np_y_scores, threshold=threshold)[0])
    accuracy += (float(contingency_table[0][0]) + float(contingency_table[1][1]))/float(np.sum(contingency_table))

print "Classification accuracy is: {}".format(accuracy/len(thresholds))

标签： python numpy machine-learning classification

1条回答

Rolldiameter

2楼-- · 2019-03-03 21:36

You are heading into the right direction. The confusion matrix definetly is the right start for computing the accuracy of your classifier. It seems to me that you are aiming at reciever operating characteristics.

In statistics, a receiver operating characteristic (ROC), or ROC curve, is a graphical plot that illustrates the performance of a binary classifier system as its discrimination threshold is varied. https://en.wikipedia.org/wiki/Receiver_operating_characteristic

The AUC (area under the curve) is a measurement of your classifiers performance. More information and explanation can be found here:

https://stats.stackexchange.com/questions/132777/what-does-auc-stand-for-and-what-is-it

http://mlwiki.org/index.php/ROC_Analysis

This is my implementation, which you are welcome to improve/comment:

def auc(y_true, y_val, plot=False):  
#check input
if len(y_true) != len(y_val):
    raise ValueError('Label vector (y_true) and corresponding value vector (y_val) must have the same length.\n')
#empty arrays, true positive and false positive numbers
tp = []
fp = []
#count 1's and -1's in y_true
cond_positive = list(y_true).count(1)
cond_negative = list(y_true).count(-1)
#all possibly relevant bias parameters stored in a list
bias_set = sorted(list(set(y_val)), key=float, reverse=True)
bias_set.append(min(bias_set)*0.9)

#initialize y_pred array full of negative predictions (-1)
y_pred = np.ones(len(y_true))*(-1)

#the computation time is mainly influenced by this for loop
#for a contamination rate of 1% it already takes ~8s to terminate
for bias in bias_set:
    #"lower values tend to correspond to label −1"
    #indices of values which exceed the bias
    posIdx = np.where(y_val > bias)
    #set predicted values to 1
    y_pred[posIdx] = 1
    #the following function simply calculates results which enable a distinction 
    #between the cases of true positive and  false positive
    results = np.asarray(y_true) + 2*np.asarray(y_pred)
    #append the amount of tp's and fp's
    tp.append(float(list(results).count(3)))
    fp.append(float(list(results).count(1)))

#calculate false positive/negative rate
tpr = np.asarray(tp)/cond_positive
fpr = np.asarray(fp)/cond_negative
#optional scatterplot
if plot == True:
    plt.scatter(fpr,tpr)
    plt.show()
#calculate AUC
AUC = np.trapz(tpr,fpr)

return AUC

0人赞添加讨论(0) 举报

Classification accuracy after recall and precision

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间