How to get performance matrices in sparkR classification, e.g., F1 score, Precision, Recall, Confusion Matrix
# Load training data
df <- read.df("data/mllib/sample_libsvm_data.txt", source = "libsvm")
training <- df
testing <- df
# Fit a random forest classification model with spark.randomForest
model <- spark.randomForest(training, label ~ features, "classification", numTrees = 10)
# Model summary
summary(model)
# Prediction
predictions <- predict(model, testing)
head(predictions)
# Performance evaluation
I've tried caret::confusionMatrix(testing$label,testing$prediction)
it shows error:
Error in unique.default(x, nmax = nmax) : unique() applies only to vectors
Caret's
confusionMatrix
will not work, since it needs R dataframes while your data are in Spark dataframes.One not recommended way for getting your metrics is to "collect" locally your Spark dataframes to R using
as.data.frame
, and then usecaret
etc.; but this means that your data can fit in the main memory of your driver machine, in which case of course you have absolutely no reason to use Spark...So, here is a way to get the accuracy in a distributed manner (i.e. without collecting data locally), using the
iris
data as an example:(Regarding the 149 correct predictions out of 150 samples, if you do a
showDF(predictions, numRows=150)
you will see indeed that there is a singlevirginica
sample misclassified asversicolor
).