- Spark version: 1.6.0
I tried computing AUC (area under ROC) grouped by the field id
. Given the following data:
# Within each key-value pair
# key is "id"
# value is a list of (score, label)
data = sc.parallelize(
[('id1', [(0.5, 1.0), (0.6, 0.0), (0.7, 1.0), (0.8, 0.0)),
('id2', [(0.5, 1.0), (0.6, 0.0), (0.7, 1.0), (0.8, 0.0))
]
The BinaryClassificationMetrics class can calculate the AUC given a list of (score, label)
.
I want to compute AUC by key (i.e. id1, id2
). But how to "map" a class
to an RDD by key?
Update
I tried to wrap the BinaryClassificationMetrics
in a function:
def auc(scoreAndLabels):
return BinaryClassificationMetrics(scoreAndLabels).areaUnderROC
And then map the wrapper function to each values:
data.groupByKey()\
.mapValues(auc)
But the list of (score, label)
is in fact of type ResultIterable
in mapValues()
while the BinaryClassificationMetrics
expects RDD
.
Is there any approach of converting the ResultIterable
to RDD
so the the auc
function can be applied? Or any other workaround for computing group-by AUC (without importing third-party modules like scikit-learn)?