Size of sample in Random Forest Regression

2020-07-09 08:32发布

If understand correctly, when Random Forest estimators are calculated usually bootstrapping is applied, which means that a tree(i) is built only using data from sample(i), chosen with replacement. I want to know what is the size of the sample that sklearn RandomForestRegressor uses.

The only thing that I see that is close:

bootstrap : boolean, optional (default=True)
    Whether bootstrap samples are used when building trees.

But there is no way to specify the size or proportion of the sample size, nor does it tell me about the default sample size.

I feel like there should be way to at least know what the default sample size is, what am I missing?

3条回答
甜甜的少女心
2楼-- · 2020-07-09 09:11

The sample size for bootstrap is always the number of samples.

You are not missing anything, the same question was asked on the mailing list for RandomForestClassifier:

The bootstrap sample size is always the same as the input sample size. If you feel up to it, a pull request updating the documentation would probably be quite welcome.

查看更多
爷的心禁止访问
3楼-- · 2020-07-09 09:26

Uhh, I agree with you it's quite strange that we cannot specify the subsample/bootstrap size in RandomForestRegressor algo. Maybe a potential workaround is to use BaggingRegressor instead. http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingRegressor.html#sklearn.ensemble.BaggingRegressor

RandomForestRegressor is just a special case of BaggingRegressor (use bootstraps to reduce the variance of a set of low-bias-high-variance estimators). In RandomForestRegressor, the base estimator is forced to be DeceisionTree, whereas in BaggingRegressor, you have the freedom to choose the base_estimator. More importantly, you can set your customized subsample size, for example max_samples=0.5 will draw random subsamples with size equal to half of the entire training set. Also, you can choose just a subset of features by setting max_features and bootstrap_features.

查看更多
Fickle 薄情
4楼-- · 2020-07-09 09:30

In the 0.22 version of scikit-learn, the max_samples option has been added, doing what you asked : here the documentation of the class.

查看更多
登录 后发表回答