When I use multiclass.roc function in R (pROC package), for instance, I trained a data set by random forest, here is my code:
# randomForest & pROC packages should be installed:
# install.packages(c('randomForest', 'pROC'))
data(iris)
library(randomForest)
library(pROC)
set.seed(1000)
# 3-class in response variable
rf = randomForest(Species~., data = iris, ntree = 100)
# predict(.., type = 'prob') returns a probability matrix
multiclass.roc(iris$Species, predict(rf, iris, type = 'prob'))
And the result is:
Call:
multiclass.roc.default(response = iris$Species, predictor = predict(rf,
iris, type = "prob"))
Data: predict(rf, iris, type = "prob") with 3 levels of iris$Species: setosa,
versicolor, virginica.
Multi-class area under the curve: 0.5142
Is this right? Thanks!!!
"pROC" reference: http://www.inside-r.org/packages/cran/pROC/docs/multiclass.roc
As you saw in the reference, multiclass.roc expects a "numeric vector (...)", and the documentation of
roc
that is linked from there (for some reason not in the link you provided) further says "of the same length thanresponse
". You are passing a numeric matrix with 3 columns, which is clearly wrong, and isn't supported any more since pROC 1.6. I have no idea what it was doing before, probably not what you were expecting.This means you must summarize your predictions in one single atomic vector of numeric mode. In the case of your model, you could use the following, although it generally doesn't really make sense to convert a factor into a numeric:
What this code really does is to compute 3 ROC curves on your predictions (one with setosa vs. versicolor, one with versicolor vs. virginica, and one with setosa vs. virginica) and average their AUC.
Three more comments:
Assuming that you did the resubstitution estimate only for sake of the minimal working example your code looks good to me.
I quickly tried to get an oob prediction with type "prob" but didn't succeed. Thus, you'll need to do a validation external to the
randomForest
function.Personally, I'd not try to summarize a whole multiclass model into one unconditional number. But that's an entirely different question.
I copied your code and got an AUC of .83. Not sure what is different.
You are right, the
s100b
column is not a probability. The aSAH (Aneurysmal subarachnoid hemorrhage) data set is a clinical data set. s100b is a protein found in glial cells in the brain. From the research paper on the dataset,s100b
column seems to represent the concentration of the s100b protein (ug/l) likely in a blood sample.