I was searching for applications for random forests, and I found the following knowledge competition on Kaggle:
https://www.kaggle.com/c/forest-cover-type-prediction.
Following the advice at
I used sklearn
to build a random forest with 500 trees.
The .oob_score_
was ~2%, but the score on the holdout set was ~75%.
There are only seven classes to classify, so 2% is really low. I also consistently got scores near 75% when I cross validated.
Can anyone explain the discrepancy between the .oob_score_
and the holdout/cross validated scores? I would expect them to be similar.
There's a similar question here:
https://stats.stackexchange.com/questions/95818/what-is-a-good-oob-score-for-random-forests
Edit: I think it might be a bug, too.
The code is given by the original poster in the second link I posted. The only change is that you have to set oob_score = True
when you build the random forest.
I didn't save the cross validation testing I did, but I could redo it if people need to see it.
Q: Can anyone explain the discrepancy ...
A: The
sklearn.ensemble.RandomForestClassifier
object and it's observed.oob_score_
attribute value is not a bug-related issue.First,
RandomForest
-based predictors{ Classifier | Regressor }
belong to rather specific corner of so called ensemble methods, so be informed, that typical approaches, incl. Cross-Validation, do not work the same way as for other AI/ML-learners.RandomForest "inner"-logic works heavily with RANDOM-PROCESS, by which the Samples ( DataSET
X
) with knowny == { labels
( for Classifier )| targets
( for Regressor )}
, gets split throughout the forest generation, where trees get bootstraped by RANDOMLY split DataSET into part, that the tree can see and a part, the tree will not see ( forming thus an inner-oob-subSET ).Besides other effects on a sensitivity to overfitting et al, the RandomForest ensemble does not have a need to get Cross-Validated, because it does not over-fit by design. Many papers and also Breiman's (Berkeley) empirical proofs have provided support for such statement, as they brought evindence, that CV-ed predictor will have the same
.oob_score_
One shall be also informed, that default values do not serve best, the less serve well under any circumstances. One shall take care to the problem-domain so as to propose a reasonable set of
ensemble
parametrisation, before moving further.Q: What is a good .oob_score_ ?
A: .oob_score_ is RANDOM! . . . . . . .....Yes, it MUST ( be random )
While this sound as a provocative epilogue, do not throw your hopes away. RandomForest ensemble is a great tool. Some problems may come with categoric-values in features ( DataSET
X
), however the costs of processing the ensemble are still adequate once you need not struggle with neither bias nor overfitting. That's great, isn't it?Due to the need to be able to reproduce same results on subsequent re-runs, it is a recommendable practice to (re-)set
numpy.random
&.set_params( random_state = ... )
to a know-state before the RANDOM-PROCESS ( embedded into every boostrapping of the RandomForest ensemble ). Doing that, one may observe a "de-noised" progression of theRandomForest
-based predictor in a direction of better.oob_score_
rather due to trully improved predictive powers introduced by more ensemble members (n_estimators
), less constrained tree-construction (max_depth
,max_leaf_nodes
et al ) and not just stochastically by just "better luck" during the RANDOM-PROCESS of how to split the DataSET...Going closer towards better solutions typically involves more trees into the ensemble ( RandomForest decisions are based on a majority vote, so 10-estimators is not a big basis for making good decisions on highly complex DataSETs ). Numbers above 2000 are not uncommon. One may iterate over a range of sizings ( with RANDOM-PROCESS kept under state-full control ) to demonstrate the ensemble "improvements".
If initial values of
.oob_score_
fall somewhere around about 0.51 - 0.53 your ensemble is 1% - 3% better than a RANDOM-GUESSOnly after you make your ensemble-based predictor to something better, you may move into some additional tricks on feature engineering et al.