SparkR 2.0 Classification: how to get performance

How to get performance matrices in sparkR classification, e.g., F1 score, Precision, Recall, Confusion Matrix

# Load training data
df <- read.df("data/mllib/sample_libsvm_data.txt", source = "libsvm")
training <- df
 testing <- df

# Fit a random forest classification model with spark.randomForest
model <- spark.randomForest(training, label ~ features, "classification", numTrees = 10)

# Model summary
  summary(model)

 # Prediction
  predictions <- predict(model, testing)
  head(predictions)

 # Performance evaluation

I've tried caret::confusionMatrix(testing$label,testing$prediction) it shows error:

   Error in unique.default(x, nmax = nmax) :   unique() applies only to vectors

标签： apache-spark machine-learning apache-spark-sql spark-dataframe sparkr

1条回答

成全新的幸福

2楼-- · 2019-08-19 12:50

Caret's confusionMatrix will not work, since it needs R dataframes while your data are in Spark dataframes.

One not recommended way for getting your metrics is to "collect" locally your Spark dataframes to R using as.data.frame, and then use caret etc.; but this means that your data can fit in the main memory of your driver machine, in which case of course you have absolutely no reason to use Spark...

So, here is a way to get the accuracy in a distributed manner (i.e. without collecting data locally), using the iris data as an example:

sparkR.version()
# "2.1.1"

df <- as.DataFrame(iris)
model <- spark.randomForest(df, Species ~ ., "classification", numTrees = 10)
predictions <- predict(model, df)
summary(predictions)
# SparkDataFrame[summary:string, Sepal_Length:string, Sepal_Width:string, Petal_Length:string, Petal_Width:string, Species:string, prediction:string]

createOrReplaceTempView(predictions, "predictions")
correct <- sql("SELECT prediction, Species FROM predictions WHERE prediction=Species")
count(correct)
# 149
acc = count(correct)/count(predictions)
acc
# 0.9933333

(Regarding the 149 correct predictions out of 150 samples, if you do a showDF(predictions, numRows=150) you will see indeed that there is a single virginica sample misclassified as versicolor).

0人赞添加讨论(0) 举报

SparkR 2.0 Classification: how to get performance

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间