What is out of bag error in Random Forests? Is it the optimal parameter for finding the right number of trees in a Random Forest?
相关问题
- How to conditionally scale values in Keras Lambda
- Trying to understand Pytorch's implementation
- Name for a method that has only side effects
- Bulding a classification model in R studio with ke
- ParameterError: Audio buffer is not finite everywh
相关文章
- Should client-server code be written in one “proje
- Algorithm for maximizing coverage of rectangular a
- How to use cross_val_score with random_state
- How to measure overfitting when train and validati
- McNemar's test in Python and comparison of cla
- How to disable keras warnings?
- Is there an existing solution for these particular
- Invert MinMaxScaler from scikit_learn
I will take an attempt to explain:
Suppose our training data set is represented by T and suppose data set has M features (or attributes or variables).
T = {(X1,y1), (X2,y2), ... (Xn, yn)}
and
summary of RF:
Random Forests algorithm is a classifier based on primarily two methods -
Suppose we decide to have
S
number of trees in our forest then we first createS
datasets of"same size as original"
created from random resampling of data in T with-replacement (n times for each dataset). This will result in{T1, T2, ... TS}
datasets. Each of these is called a bootstrap dataset. Due to "with-replacement" every datasetTi
can have duplicate data records and Ti can be missing several data records from original datasets. This is calledBootstrapping
. (en.wikipedia.org/wiki/Bootstrapping_(statistics))Bagging is the process of taking bootstraps & then aggregating the models learned on each bootstrap.
Now, RF creates
S
trees and usesm (=sqrt(M) or =floor(lnM+1))
random subfeatures out ofM
possible features to create any tree. This is called random subspace method.So for each
Ti
bootstrap dataset you create a treeKi
. If you want to classify some input dataD = {x1, x2, ..., xM}
you let it pass through each tree and produceS
outputs (one for each tree) which can be denoted byY = {y1, y2, ..., ys}
. Final prediction is a majority vote on this set.Out-of-bag error:
After creating the classifiers (
S
trees), for each(Xi,yi)
in the original training set i.e.T
, select allTk
which does not include(Xi,yi)
. This subset, pay attention, is a set of boostrap datasets which does not contain a particular record from the original dataset. This set is called out-of-bag examples. There aren
such subsets (one for each data record in original dataset T). OOB classifier is the aggregation of votes ONLY overTk
such that it does not contain(xi,yi)
.Out-of-bag estimate for the generalization error is the error rate of the out-of-bag classifier on the training set (compare it with known
yi
's).Why is it important? The study of error estimates for bagged classifiers in Breiman [1996b], gives empirical evidence to show that the out-of-bag estimate is as accurate as using a test set of the same size as the training set. Therefore, using the out-of-bag error estimate removes the need for a set aside test set.
(Thanks @Rudolf for corrections. His comments below.)
In Breiman's original implementation of the random forest algorithm, each tree is trained on about 2/3 of the total training data. As the forest is built, each tree can thus be tested (similar to leave one out cross validation) on the samples not used in building that tree. This is the out of bag error estimate - an internal error estimate of a random forest as it is being constructed.