sklearn random forest: .oob_score_ too low?

2019-06-22 05:42发布

问题:

I was searching for applications for random forests, and I found the following knowledge competition on Kaggle:

https://www.kaggle.com/c/forest-cover-type-prediction.

Following the advice at

https://www.kaggle.com/c/forest-cover-type-prediction/forums/t/8182/first-try-with-random-forests-scikit-learn,

I used sklearn to build a random forest with 500 trees.

The .oob_score_ was ~2%, but the score on the holdout set was ~75%.

There are only seven classes to classify, so 2% is really low. I also consistently got scores near 75% when I cross validated.

Can anyone explain the discrepancy between the .oob_score_ and the holdout/cross validated scores? I would expect them to be similar.

There's a similar question here:

https://stats.stackexchange.com/questions/95818/what-is-a-good-oob-score-for-random-forests

Edit: I think it might be a bug, too.

The code is given by the original poster in the second link I posted. The only change is that you have to set oob_score = True when you build the random forest.

I didn't save the cross validation testing I did, but I could redo it if people need to see it.

回答1:

Q: Can anyone explain the discrepancy ...

A: The sklearn.ensemble.RandomForestClassifier object and it's observed .oob_score_ attribute value is not a bug-related issue.

First, RandomForest-based predictors { Classifier | Regressor } belong to rather specific corner of so called ensemble methods, so be informed, that typical approaches, incl. Cross-Validation, do not work the same way as for other AI/ML-learners.

RandomForest "inner"-logic works heavily with RANDOM-PROCESS, by which the Samples ( DataSET X ) with known y == { labels ( for Classifier ) | targets ( for Regressor ) }, gets split throughout the forest generation, where trees get bootstraped by RANDOMLY split DataSET into part, that the tree can see and a part, the tree will not see ( forming thus an inner-oob-subSET ).

Besides other effects on a sensitivity to overfitting et al, the RandomForest ensemble does not have a need to get Cross-Validated, because it does not over-fit by design. Many papers and also Breiman's (Berkeley) empirical proofs have provided support for such statement, as they brought evindence, that CV-ed predictor will have the same .oob_score_

import sklearn.ensemble
aRF_PREDICTOR = sklearn.ensemble.RandomForestRegressor( n_estimators                = 10,           # The number of trees in the forest.
                                                        criterion                   = 'mse',        # { Regressor: 'mse' | Classifier: 'gini' }
                                                        max_depth                   = None,
                                                        min_samples_split           = 2,
                                                        min_samples_leaf            = 1,
                                                        min_weight_fraction_leaf    = 0.0,
                                                        max_features                = 'auto',
                                                        max_leaf_nodes              = None,
                                                        bootstrap                   = True,
                                                        oob_score                   = False,        # SET True to get inner-CrossValidation-alike .oob_score_ attribute calculated right during Training-phase on the whole DataSET
                                                        n_jobs                      = 1,            # { 1 | n-cores | -1 == all-cores }
                                                        random_state                = None,
                                                        verbose                     = 0,
                                                        warm_start                  = False
                                                        )
aRF_PREDICTOR.estimators_                             # aList of <DecisionTreeRegressor>  The collection of fitted sub-estimators.
aRF_PREDICTOR.feature_importances_                    # array of shape = [n_features]     The feature importances (the higher, the more important the feature).
aRF_PREDICTOR.oob_score_                              # float                             Score of the training dataset obtained using an out-of-bag estimate.
aRF_PREDICTOR.oob_prediction_                         # array of shape = [n_samples]      Prediction computed with out-of-bag estimate on the training set.

aRF_PREDICTOR.apply(         X )                      # Apply trees in the forest to X, return leaf indices.
aRF_PREDICTOR.fit(           X, y[, sample_weight] )  # Build a forest of trees from the training set (X, y).
aRF_PREDICTOR.fit_transform( X[, y] )                 # Fit to data, then transform it.
aRF_PREDICTOR.get_params(          [deep] )           # Get parameters for this estimator.
aRF_PREDICTOR.predict(       X )                      # Predict regression target for X.
aRF_PREDICTOR.score(         X, y[, sample_weight] )  # Returns the coefficient of determination R^2 of the prediction.
aRF_PREDICTOR.set_params(          **params )         # Set the parameters of this estimator.
aRF_PREDICTOR.transform(     X[, threshold] )         # Reduce X to its most important features.

One shall be also informed, that default values do not serve best, the less serve well under any circumstances. One shall take care to the problem-domain so as to propose a reasonable set of ensemble parametrisation, before moving further.


Q: What is a good .oob_score_ ?

A: .oob_score_ is RANDOM! . . . . . . .....Yes, it MUST ( be random )

While this sound as a provocative epilogue, do not throw your hopes away. RandomForest ensemble is a great tool. Some problems may come with categoric-values in features ( DataSET X ), however the costs of processing the ensemble are still adequate once you need not struggle with neither bias nor overfitting. That's great, isn't it?

Due to the need to be able to reproduce same results on subsequent re-runs, it is a recommendable practice to (re-)set numpy.random & .set_params( random_state = ... ) to a know-state before the RANDOM-PROCESS ( embedded into every boostrapping of the RandomForest ensemble ). Doing that, one may observe a "de-noised" progression of the RandomForest-based predictor in a direction of better .oob_score_ rather due to trully improved predictive powers introduced by more ensemble members ( n_estimators ), less constrained tree-construction ( max_depth, max_leaf_nodes et al ) and not just stochastically by just "better luck" during the RANDOM-PROCESS of how to split the DataSET...

Going closer towards better solutions typically involves more trees into the ensemble ( RandomForest decisions are based on a majority vote, so 10-estimators is not a big basis for making good decisions on highly complex DataSETs ). Numbers above 2000 are not uncommon. One may iterate over a range of sizings ( with RANDOM-PROCESS kept under state-full control ) to demonstrate the ensemble "improvements".

If initial values of .oob_score_ fall somewhere around about 0.51 - 0.53 your ensemble is 1% - 3% better than a RANDOM-GUESS

Only after you make your ensemble-based predictor to something better, you may move into some additional tricks on feature engineering et al.

aRF_PREDICTOR.oob_score_    Out[79]: 0.638801  # n_estimators =   10
aRF_PREDICTOR.oob_score_    Out[89]: 0.789612  # n_estimators =  100