I would like to understand the meaning of the value (result) of h2o.predict() function from H2o R-package. I realized that in some cases when the predict
column is 1
, the p1
column has a lower value than the column p0
. My interpretation of p0
and p1
columns refer to the probabilities for each event, so I expected when predict=1
the probability of p1
should be higher than the probability of the opposite event (p0
), but it doesn't occur always as I can show in the following example: using prostate dataset.
Here is executable example:
library(h2o)
h2o.init(max_mem_size = "12g", nthreads = -1)
prostate.hex <- h2o.importFile("https://h2o-public-test-data.s3.amazonaws.com/smalldata/prostate/prostate.csv")
prostate.hex$CAPSULE <- as.factor(prostate.hex$CAPSULE)
prostate.hex$RACE <- as.factor(prostate.hex$RACE)
prostate.hex$DCAPS <- as.factor(prostate.hex$DCAPS)
prostate.hex$DPROS <- as.factor(prostate.hex$DPROS)
prostate.hex.split = h2o.splitFrame(data = prostate.hex,
ratios = c(0.70, 0.20, 0.10), seed = 1234)
train.hex <- prostate.hex.split[[1]]
validate.hex <- prostate.hex.split[[2]]
test.hex <- prostate.hex.split[[3]]
fit <- h2o.glm(y = "CAPSULE", x = c("AGE", "RACE", "PSA", "DCAPS"),
training_frame = train.hex,
validation_frame = validate.hex,
family = "binomial", nfolds = 0, alpha = 0.5)
prostate.predict = h2o.predict(object = fit, newdata = test.hex)
result <- as.data.frame(prostate.predict)
subset(result, predict == 1 & p1 < 0.4)
I get the following output for the result of the subset
function:
predict p0 p1
11 1 0.6355974 0.3644026
17 1 0.6153021 0.3846979
23 1 0.6289063 0.3710937
25 1 0.6007919 0.3992081
31 1 0.6239587 0.3760413
For all the above observations from test.hex
dataset the prediction is 1
, but p0 > p1
.
The total observation where predict=1
but p1 < p0
is:
> nrow(subset(result, predict == 1 & p1 < p0))
[1] 14
On contrary there are no predict=0
where p0 < p1
> nrow(subset(result, predict == 0 & p0 < p1))
[1] 0
Here is the table for table
information for predict
:
> table(result$predict)
0 1
18 23
We are using as a decision variable CAPSULE
with the following values:
> levels(as.data.frame(prostate.hex)$CAPSULE)
[1] "0" "1"
Any suggestion?
Note: The question with a similar topic: How to interpret results of h2o.predict does not address this specific issue.
What you are describing is a threshold of 0.5. In fact a different threshold will be used, one that maximizes a certain metric. The default metric is F1 (*); if you print the model information you can find the thresholds used for each metric.
See the question: How to understand the metrics of H2OModelMetrics Object through h2o.performance? for more on this (your question was different, which was why I didn't mark it as a duplicate).
As far as I know you cannot change the F1 default to either
h2o.predict()
orh2o.performance()
. But instead you can useh2o.confusionMatrix()
Given your model
fit
, and to use max F2 instead:You can also just use the
h2o.predict()
"p0" column directly, with your own threshold, instead of the "predict" column. (That is what I have done, before.)*: The definition is here: https://github.com/h2oai/h2o-3/blob/fdde85e41bad5f31b6b841b300ce23cfb2d8c0b0/h2o-core/src/main/java/hex/AUC2.java#L34 Further down that file also shows how each of the metrics is calculated.
It seems (also see here) that the threshold that maximizes the
F1 score
on thevalidation
dataset is used as the default threshold for classification withh2o.glm()
. We can observe the following:F1 score
on the validation dataset is0.363477
.p1
probability less than this threshold value are classified as0
class (a datapoint predicted to be a0
class has the highestp1
probability =0.3602365
<0.363477
).all datapoints with predicted
p1
probability greater than this threshold value are classified as1
class (a datapoint predicted to be a1
class has the lowestp1
probability =0.3644026
>0.363477
).