I am trying to make predictions with the model that I got back from MLlib on Spark. The goal is to generate tuples of (orinalLabelInData, predictedLabel). Then those tuples can be used for model evaluation purpose. What is the best way to achieve this? Thanks.
Assuming parsedTrainData is a RDD of LabeledPoint
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.tree import DecisionTree, DecisionTreeModel
from pyspark.mllib.util import MLUtils
parsedTrainData = sc.parallelize([LabeledPoint(1.0, [11.0,-12.0,23.0]),
LabeledPoint(3.0, [-1.0,12.0,-23.0])])
model = DecisionTree.trainClassifier(parsedTrainData, numClasses=7,
categoricalFeaturesInfo={}, impurity='gini', maxDepth=8, maxBins=32)
model.predict(parsedTrainData.map(lambda x: x.features)).take(1)
This gives back the predictions, but I am not sure how to match each prediction back to the original labels in data.
I tried
parsedTrainData.map(lambda x: (x.label, dtModel.predict(x.features))).take(1)
however, it seems like my way of sending model to worker is not a valid thing to do here
/spark140/python/pyspark/context.pyc in __getnewargs__(self)
250 # This method is called when attempting to pickle SparkContext, which is always an error:
251 raise Exception(
--> 252 "It appears that you are attempting to reference SparkContext from a broadcast "
253 "variable, action, or transforamtion. SparkContext can only be used on the driver, "
254 "not in code that it run on workers. For more information, see SPARK-5063."
Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
Well, according to the official documentation you can simply zip predictions and labels like this: