Classification accuracy after recall and precision

2019-03-03 20:50发布

I'm just wondering if this is a legitimate way of calculating classification accuracy:

  1. obtain precision recall thresholds
  2. for each threshold binarize the continuous y_scores
  3. calculate their accuracy from the contingency table (confusion matrix)
  4. return the average accuracy for the thresholds

    recall, precision, thresholds = precision_recall_curve(np.array(np_y_true), np.array(np_y_scores))
    accuracy = 0
    for threshold in thresholds:
        contingency_table = confusion_matrix(np_y_true, binarize(np_y_scores, threshold=threshold)[0])
        accuracy += (float(contingency_table[0][0]) + float(contingency_table[1][1]))/float(np.sum(contingency_table))
    
    print "Classification accuracy is: {}".format(accuracy/len(thresholds))
    

1条回答
Rolldiameter
2楼-- · 2019-03-03 21:36

You are heading into the right direction. The confusion matrix definetly is the right start for computing the accuracy of your classifier. It seems to me that you are aiming at reciever operating characteristics.

In statistics, a receiver operating characteristic (ROC), or ROC curve, is a graphical plot that illustrates the performance of a binary classifier system as its discrimination threshold is varied. https://en.wikipedia.org/wiki/Receiver_operating_characteristic

The AUC (area under the curve) is a measurement of your classifiers performance. More information and explanation can be found here:

https://stats.stackexchange.com/questions/132777/what-does-auc-stand-for-and-what-is-it

http://mlwiki.org/index.php/ROC_Analysis

This is my implementation, which you are welcome to improve/comment:

def auc(y_true, y_val, plot=False):  
#check input
if len(y_true) != len(y_val):
    raise ValueError('Label vector (y_true) and corresponding value vector (y_val) must have the same length.\n')
#empty arrays, true positive and false positive numbers
tp = []
fp = []
#count 1's and -1's in y_true
cond_positive = list(y_true).count(1)
cond_negative = list(y_true).count(-1)
#all possibly relevant bias parameters stored in a list
bias_set = sorted(list(set(y_val)), key=float, reverse=True)
bias_set.append(min(bias_set)*0.9)

#initialize y_pred array full of negative predictions (-1)
y_pred = np.ones(len(y_true))*(-1)

#the computation time is mainly influenced by this for loop
#for a contamination rate of 1% it already takes ~8s to terminate
for bias in bias_set:
    #"lower values tend to correspond to label −1"
    #indices of values which exceed the bias
    posIdx = np.where(y_val > bias)
    #set predicted values to 1
    y_pred[posIdx] = 1
    #the following function simply calculates results which enable a distinction 
    #between the cases of true positive and  false positive
    results = np.asarray(y_true) + 2*np.asarray(y_pred)
    #append the amount of tp's and fp's
    tp.append(float(list(results).count(3)))
    fp.append(float(list(results).count(1)))

#calculate false positive/negative rate
tpr = np.asarray(tp)/cond_positive
fpr = np.asarray(fp)/cond_negative
#optional scatterplot
if plot == True:
    plt.scatter(fpr,tpr)
    plt.show()
#calculate AUC
AUC = np.trapz(tpr,fpr)

return AUC
查看更多
登录 后发表回答