If understand correctly, when Random Forest estimators are calculated usually bootstrapping is applied, which means that a tree(i) is built only using data from sample(i), chosen with replacement. I want to know what is the size of the sample that sklearn RandomForestRegressor uses.
The only thing that I see that is close:
bootstrap : boolean, optional (default=True)
Whether bootstrap samples are used when building trees.
But there is no way to specify the size or proportion of the sample size, nor does it tell me about the default sample size.
I feel like there should be way to at least know what the default sample size is, what am I missing?
The sample size for bootstrap is always the number of samples.
You are not missing anything, the same question was asked on the mailing list for
RandomForestClassifier
:Uhh, I agree with you it's quite strange that we cannot specify the subsample/bootstrap size in
RandomForestRegressor
algo. Maybe a potential workaround is to useBaggingRegressor
instead. http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingRegressor.html#sklearn.ensemble.BaggingRegressorRandomForestRegressor
is just a special case ofBaggingRegressor
(use bootstraps to reduce the variance of a set of low-bias-high-variance estimators). InRandomForestRegressor
, the base estimator is forced to beDeceisionTree
, whereas inBaggingRegressor
, you have the freedom to choose thebase_estimator
. More importantly, you can set your customized subsample size, for examplemax_samples=0.5
will draw random subsamples with size equal to half of the entire training set. Also, you can choose just a subset of features by settingmax_features
andbootstrap_features
.In the 0.22 version of scikit-learn, the
max_samples
option has been added, doing what you asked : here the documentation of the class.