可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I know the AUC/ROC area (http://weka.wikispaces.com/Area+under+the+curve) in weka is based on the e Mann Whitney statistic (http://en.wikipedia.org/wiki/Mann-Whitney_U)

But my doubt is, if I've got 10 labeled instances (Y or N, binary target attribute), by applying an algorithm (i.e. J48) onto the dataset, then there are 10 predicted labels on these 10 instances. Then what exactly should I use to calculate the AUC_Y, AUC_N, and AUC_Avg? Use the prediction's ranked label Y and N or the actual label (Y' and N')? Or I need to calculate the TP rate and FP rate?

Can anyone give me a small example and point me to what data should I use to calculate the AUC based on Mann Whitney statistic approach? Thanks in advanced.

Sample data:

inst#    actual predicted  error   PrY     PrN
1        1:y        1:y          *0.973   0.027
2        1:y        1:y          *0.999   0.001
3        2:n        1:y      +   *0.568   0.432
4        2:n        2:n           0.382  *0.618
5        1:y        2:n      +    0.421  *0.579
6        2:n        2:n           0.146  *0.854
7        1:y        1:y          *1       0    
8        1:y        1:y          *0.999   0.001
9        2:n        2:n           0.11   *0.89 
10       1:y        2:n      +    0.377  *0.623

回答1:

Calculating the AUC is based on ranking your results. I've just read up on the Mann-Whitney-U statistic and I think it is basically how I do it in my code all the time.

First, you need something to rank your results. Usually, this is the decision value of your classifier (e.g. distance to the hyperplane for SVMs), but WEKA mostly uses the class probability. In your example, PrY and PrN sum up to 1, which is good, so you can pick either one, say PrY.

You then rank your instances by PrN:

inst#    actual predicted  error   PrY     PrN
7        1:y        1:y          *1       0    
8        1:y        1:y          *0.999   0.001
2        1:y        1:y          *0.999   0.001
1        1:y        1:y          *0.973   0.027
3        2:n        1:y      +   *0.568   0.432
5        1:y        2:n      +    0.421  *0.579
4        2:n        2:n           0.382  *0.618
10       1:y        2:n      +    0.377  *0.623
6        2:n        2:n           0.146  *0.854
9        2:n        2:n           0.11   *0.89

From what Wikipedia says about the Mann-Whitney-U statistic, you now need to sum up for each actual class, how often it is "beaten" by the other class. For the positive instances (y), this would be

0, 0, 0, 0, 1, 2 => Sum: 3

and for the negative instances (n)

4, 5, 6, 6 => Sum: 21

So U_y = 3 and U_n = 21, checking it:

U_y + U_n = 24 = 6 * 4 = #y * #n

AUC_y then would be (after wikipedia)

AUC_y = U_y / (#y * #n) = 3 / 24 = 0.125
AUC_n = U_n / (#y * #n) = 21 / 24 = 0.875

Now, in this case I strongly believe that AUC_n is the AUC you want. We sorted for PrN in ascending order, so AUC_n is what we want.

A more intuitive and graphical description of what we just did is this:

We sort our instances by their decision value / class probability. If we sort ascending by PrN, the positive ones should come first. (If we sort ascending by PrY, the negative ones should come first.) Now we draw a plot, beginning at coordinates (0,0). Everytime we encounter an actual positive instance, we draw one unit up. Everytime we encounter a negative instance, we draw one unit right. This line now separates to areas, which look like this in ASCII art (I'll replace it with a decent image as soon as I can):

|..##|
|.###|
|####|
|####|
|####|
|####|

The separating line is the ROC and the area under it (hence the name) the AUC. The AUC here is 21 units, which we need to normalize by dividing it by the total area of 24, yielding 21/24 = 0.875

You can also do the whole calculation already normalized, which is equivalent to plotting it as FPR vs TPR.

回答2:

Late to the party, but here is some R code I wrote to calculate from your data AUC and plot ROC. I used your actual and PrY fields, in this case. Hope this helps you see how the calculations can be done.

true_Y = c(1,1,1,1,2,1,2,1,2,2)
probs = c(1,0.999,0.999,0.973,0.568,0.421,0.382,0.377,0.146,0.11)

getROC_AUC = function(probs, true_Y){
    probsSort = sort(probs, decreasing = TRUE, index.return = TRUE)
    val = unlist(probsSort$x)
    idx = unlist(probsSort$ix)  

    roc_y = true_Y[idx];
    stack_x = cumsum(roc_y == 2)/sum(roc_y == 2)
    stack_y = cumsum(roc_y == 1)/sum(roc_y == 1)    

    auc = sum((stack_x[2:length(roc_y)]-stack_x[1:length(roc_y)-1])*stack_y[2:length(roc_y)])
    return(list(stack_x=stack_x, stack_y=stack_y, auc=auc))
}

aList = getROC_AUC(probs, true_Y) 

stack_x = unlist(aList$stack_x)
stack_y = unlist(aList$stack_y)
auc = unlist(aList$auc)

plot(stack_x, stack_y, type = "l", col = "blue", xlab = "False Positive Rate", ylab = "True Positive Rate", main = "ROC")
axis(1, seq(0.0,1.0,0.1))
axis(2, seq(0.0,1.0,0.1))
abline(h=seq(0.0,1.0,0.1), v=seq(0.0,1.0,0.1), col="gray", lty=3)
legend(0.7, 0.3, sprintf("%3.3f",auc), lty=c(1,1), lwd=c(2.5,2.5), col="blue", title = "AUC")