I know the AUC/ROC area (http://weka.wikispaces.com/Area+under+the+curve) in weka is based on the e Mann Whitney statistic (http://en.wikipedia.org/wiki/Mann-Whitney_U)
But my doubt is, if I've got 10 labeled instances (Y or N, binary target attribute), by applying an algorithm (i.e. J48) onto the dataset, then there are 10 predicted labels on these 10 instances. Then what exactly should I use to calculate the AUC_Y, AUC_N, and AUC_Avg? Use the prediction's ranked label Y and N or the actual label (Y' and N')? Or I need to calculate the TP rate and FP rate?
Can anyone give me a small example and point me to what data should I use to calculate the AUC based on Mann Whitney statistic approach? Thanks in advanced.
Sample data:
inst# actual predicted error PrY PrN
1 1:y 1:y *0.973 0.027
2 1:y 1:y *0.999 0.001
3 2:n 1:y + *0.568 0.432
4 2:n 2:n 0.382 *0.618
5 1:y 2:n + 0.421 *0.579
6 2:n 2:n 0.146 *0.854
7 1:y 1:y *1 0
8 1:y 1:y *0.999 0.001
9 2:n 2:n 0.11 *0.89
10 1:y 2:n + 0.377 *0.623
Late to the party, but here is some R code I wrote to calculate from your data AUC and plot ROC. I used your
actual
andPrY
fields, in this case. Hope this helps you see how the calculations can be done.Calculating the AUC is based on ranking your results. I've just read up on the Mann-Whitney-U statistic and I think it is basically how I do it in my code all the time.
First, you need something to rank your results. Usually, this is the decision value of your classifier (e.g. distance to the hyperplane for SVMs), but WEKA mostly uses the class probability. In your example, PrY and PrN sum up to 1, which is good, so you can pick either one, say PrY.
You then rank your instances by PrN:
From what Wikipedia says about the Mann-Whitney-U statistic, you now need to sum up for each actual class, how often it is "beaten" by the other class. For the positive instances (y), this would be
and for the negative instances (n)
So U_y = 3 and U_n = 21, checking it:
AUC_y then would be (after wikipedia)
Now, in this case I strongly believe that AUC_n is the AUC you want. We sorted for PrN in ascending order, so AUC_n is what we want.
A more intuitive and graphical description of what we just did is this:
We sort our instances by their decision value / class probability. If we sort ascending by PrN, the positive ones should come first. (If we sort ascending by PrY, the negative ones should come first.) Now we draw a plot, beginning at coordinates (0,0). Everytime we encounter an actual positive instance, we draw one unit up. Everytime we encounter a negative instance, we draw one unit right. This line now separates to areas, which look like this in ASCII art (I'll replace it with a decent image as soon as I can):
The separating line is the ROC and the area under it (hence the name) the AUC. The AUC here is 21 units, which we need to normalize by dividing it by the total area of 24, yielding 21/24 = 0.875
You can also do the whole calculation already normalized, which is equivalent to plotting it as FPR vs TPR.