Spark : regression model threshold and precision

2019-03-21 22:48发布

I have logistic regression mode, where I explicitly set the threshold to 0.5.

model.setThreshold(0.5)

I train the model and then I want to get basic stats -- precision, recall etc.

This is what I do when I evaluate the model:

val metrics = new BinaryClassificationMetrics(predictionAndLabels)

val precision = metrics.precisionByThreshold


precision.foreach { case (t, p) =>

      println(s"Threshold is: $t, Precision is: $p")

    }

I get results with only 0.0 and 1.0 as values of threshold and 0.5 is completely ignored.

Here is the output of the above loop:

Threshold is: 1.0, Precision is: 0.8571428571428571

Threshold is: 0.0, Precision is: 0.3005181347150259

When I call metrics.thresholds() it also returns only two values, 0.0 and 1.0.

How do I get the precision and recall values with threshold as 0.5?

3条回答
狗以群分
2楼-- · 2019-03-21 23:21

You need to clear the model threshold before you make predictions. Clearing threshold makes your predictions return a score and not the classified label. If not you will only have two thresholds, i.e. your labels 0.0 and 1.0.

model.clearThreshold()

A tuple from predictionsAndLabels should look like (0.6753421,1.0) and not (1.0,1.0)

Take a look at https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/BinaryClassificationMetricsExample.scala

You probably still want to set numBins to control the number of points if the input is large.

查看更多
趁早两清
3楼-- · 2019-03-21 23:35

First, try adding more bins like this (here numBins is 10):

val metrics = new BinaryClassificationMetrics(probabilitiesAndLabels,10);

If you still only have two thresholds of 0 and 1, then check to make sure the way you have defined your predictionAndLabels. You many be having this problem if you have accidentally provided (label, prediction) instead of (prediction, label).

查看更多
唯我独甜
4楼-- · 2019-03-21 23:39

I think what happens is that all the predictions are 0.0 or 1.0. Then the intermediate threshold values make no difference.

Consider the numBins argument of BinaryClassificationMetrics:

numBins: if greater than 0, then the curves (ROC curve, PR curve) computed internally will be down-sampled to this many "bins". If 0, no down-sampling will occur. This is useful because the curve contains a point for each distinct score in the input, and this could be as large as the input itself -- millions of points or more, when thousands may be entirely sufficient to summarize the curve. After down-sampling, the curves will instead be made of approximately numBins points instead. Points are made from bins of equal numbers of consecutive points. The size of each bin is floor(scoreAndLabels.count() / numBins), which means the resulting number of bins may not exactly equal numBins. The last bin in each partition may be smaller as a result, meaning there may be an extra sample at partition boundaries.

So if you don't set numBins, then precision will be calculated at all the different prediction values. In your case this seems to be just 0.0 and 1.0.

查看更多
登录 后发表回答