I am using Apache spark in batch mode. I have set up an entire pipeline that transforms text into TFIDF vectors and then predicts a boolean class using Logistic regression:
# Chain previously created feature transformers, indexers and regression in a Pipeline
pipeline = Pipeline(stages=[tokenizer, hashingTF, idf,
labelIndexer, featureIndexer, lr])
#Fit the full model to the training data
model = pipeline.fit(trainingData)
#Predict test data
predictions = model.transform(testData)
I can examine predictions
, which is a spark dataframe, and it is what I expect it to be.
Next, I want to see a confusion matrix, so I convert the scores and labels to a RDD and pass that to BinaryClassificationMetrics():
predictionAndLabels = predictions.select('prediction','label').rdd
Finally, I pass that to the BinaryClassificationMetrics:
metrics = BinaryClassificationMetrics(predictionAndLabels) #this errors out
Here's the error:
AttributeError: 'SparkSession' object has no attribute 'serializer'
This error is not helpful and searching for it raises a broad spectrum of issues. the only thing I've found that seems similar is this post which has no answers: How to resolve error "AttributeError: 'SparkSession' object has no attribute 'serializer'?
Any assistance is appreciated!