What is out of bag error in Random Forests? Is it the optimal parameter for finding the right number of trees in a Random Forest?
问题:
回答1:
I will take an attempt to explain:
Suppose our training data set is represented by T and suppose data set has M features (or attributes or variables).
T = {(X1,y1), (X2,y2), ... (Xn, yn)}
and
Xi is input vector {xi1, xi2, ... xiM}
yi is the label (or output or class).
summary of RF:
Random Forests algorithm is a classifier based on primarily two methods -
- Bagging
- Random subspace method.
Suppose we decide to have S
number of trees in our forest then we first create S
datasets of "same size as original"
created from random resampling of data in T with-replacement (n times for each dataset). This will result in {T1, T2, ... TS}
datasets. Each of these is called a bootstrap dataset. Due to "with-replacement" every dataset Ti
can have duplicate data records and Ti can be missing several data records from original datasets. This is called Bootstrapping
. (en.wikipedia.org/wiki/Bootstrapping_(statistics))
Bagging is the process of taking bootstraps & then aggregating the models learned on each bootstrap.
Now, RF creates S
trees and uses m (=sqrt(M) or =floor(lnM+1))
random subfeatures out of M
possible features to create any tree. This is called random subspace method.
So for each Ti
bootstrap dataset you create a tree Ki
. If you want to classify some input data D = {x1, x2, ..., xM}
you let it pass through each tree and produce S
outputs (one for each tree) which can be denoted by Y = {y1, y2, ..., ys}
. Final prediction is a majority vote on this set.
Out-of-bag error:
After creating the classifiers (S
trees), for each (Xi,yi)
in the original training set i.e. T
, select all Tk
which does not include (Xi,yi)
. This subset, pay attention, is a set of boostrap datasets which does not contain a particular record from the original dataset. This set is called out-of-bag examples. There are n
such subsets (one for each data record in original dataset T). OOB classifier is the aggregation of votes ONLY over Tk
such that it does not contain (xi,yi)
.
Out-of-bag estimate for the generalization error is the error rate of the out-of-bag classifier on the training set (compare it with known yi
's).
Why is it important? The study of error estimates for bagged classifiers in Breiman [1996b], gives empirical evidence to show that the out-of-bag estimate is as accurate as using a test set of the same size as the training set. Therefore, using the out-of-bag error estimate removes the need for a set aside test set.
(Thanks @Rudolf for corrections. His comments below.)
回答2:
In Breiman's original implementation of the random forest algorithm, each tree is trained on about 2/3 of the total training data. As the forest is built, each tree can thus be tested (similar to leave one out cross validation) on the samples not used in building that tree. This is the out of bag error estimate - an internal error estimate of a random forest as it is being constructed.