I am trying to find Accuracy using 5-fold cross validation using Random Forest Classifier Model in SCALA. But i am getting the following error while running:
java.lang.IllegalArgumentException: RandomForestClassifier was given input with invalid label column label, without the number of classes specified. See StringIndexer.
Getting the above error at line---> val cvModel = cv.fit(trainingData)
The code which i used for cross validation of data set using random forest is as follows:
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator}
import org.apache.spark.ml.classification.RandomForestClassifier
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
val data = sc.textFile("exprogram/dataset.txt")
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts(41).toDouble,
Vectors.dense(parts(0).split(',').map(_.toDouble)))
}
val splits = parsedData.randomSplit(Array(0.6, 0.4), seed = 11L)
val training = splits(0)
val test = splits(1)
val trainingData = training.toDF()
val testData = test.toDF()
val nFolds: Int = 5
val NumTrees: Int = 5
val rf = new
RandomForestClassifier()
.setLabelCol("label")
.setFeaturesCol("features")
.setNumTrees(NumTrees)
val pipeline = new Pipeline()
.setStages(Array(rf))
val paramGrid = new ParamGridBuilder()
.build()
val evaluator = new MulticlassClassificationEvaluator()
.setLabelCol("label")
.setPredictionCol("prediction")
.setMetricName("precision")
val cv = new CrossValidator()
.setEstimator(pipeline)
.setEvaluator(evaluator)
.setEstimatorParamMaps(paramGrid)
.setNumFolds(nFolds)
val cvModel = cv.fit(trainingData)
val results = cvModel.transform(testData)
.select("label","prediction").collect
val numCorrectPredictions = results.map(row =>
if (row.getDouble(0) == row.getDouble(1)) 1 else 0).foldLeft(0)(_ + _)
val accuracy = 1.0D * numCorrectPredictions / results.size
println("Test set accuracy: %.3f".format(accuracy))
Can any one please explain what is the mistake in the above code.
RandomForestClassifier
, same as many other ML algorithms, require specific metadata to be set on the label column and labels values to be integral values from [0, 1, 2 ..., #classes) represented as doubles. Typically this is handled by an upstreamTransformers
likeStringIndexer
. Since you convert labels manually metadata fields are not set and classifier cannot confirm that these requirements are satisfied.You can either re-encode label column using
StringIndexer
:or set required metadata manually:
Note:
Labels created using
StringIndexer
depend on the frequency not value:PySpark:
In Python metadata fields can be set directly on the schema: