What is out of bag error in Random Forests? [close

What is out of bag error in Random Forests? Is it the optimal parameter for finding the right number of trees in a Random Forest?

标签： language-agnostic machine-learning classification random-forest

2条回答

该账号已被封号

2楼-- · 2019-03-07 13:16

I will take an attempt to explain:

Suppose our training data set is represented by T and suppose data set has M features (or attributes or variables).

T = {(X1,y1), (X2,y2), ... (Xn, yn)}

and

Xi is input vector {xi1, xi2, ... xiM}

yi is the label (or output or class).

summary of RF:

Random Forests algorithm is a classifier based on primarily two methods -

Bagging
Random subspace method.

Suppose we decide to have S number of trees in our forest then we first create S datasets of "same size as original" created from random resampling of data in T with-replacement (n times for each dataset). This will result in {T1, T2, ... TS} datasets. Each of these is called a bootstrap dataset. Due to "with-replacement" every dataset Ti can have duplicate data records and Ti can be missing several data records from original datasets. This is called Bootstrapping. (en.wikipedia.org/wiki/Bootstrapping_(statistics))

Bagging is the process of taking bootstraps & then aggregating the models learned on each bootstrap.

Now, RF creates S trees and uses m (=sqrt(M) or =floor(lnM+1)) random subfeatures out of M possible features to create any tree. This is called random subspace method.

So for each Ti bootstrap dataset you create a tree Ki. If you want to classify some input data D = {x1, x2, ..., xM} you let it pass through each tree and produce S outputs (one for each tree) which can be denoted by Y = {y1, y2, ..., ys}. Final prediction is a majority vote on this set.

Out-of-bag error:

After creating the classifiers (S trees), for each (Xi,yi) in the original training set i.e. T, select all Tk which does not include (Xi,yi). This subset, pay attention, is a set of boostrap datasets which does not contain a particular record from the original dataset. This set is called out-of-bag examples. There are n such subsets (one for each data record in original dataset T). OOB classifier is the aggregation of votes ONLY over Tk such that it does not contain (xi,yi).

Out-of-bag estimate for the generalization error is the error rate of the out-of-bag classifier on the training set (compare it with known yi's).

Why is it important? The study of error estimates for bagged classifiers in Breiman [1996b], gives empirical evidence to show that the out-of-bag estimate is as accurate as using a test set of the same size as the training set. Therefore, using the out-of-bag error estimate removes the need for a set aside test set.

(Thanks @Rudolf for corrections. His comments below.)

0人赞添加讨论(0) 举报

虎瘦雄心在

3楼-- · 2019-03-07 13:32

In Breiman's original implementation of the random forest algorithm, each tree is trained on about 2/3 of the total training data. As the forest is built, each tree can thus be tested (similar to leave one out cross validation) on the samples not used in building that tree. This is the out of bag error estimate - an internal error estimate of a random forest as it is being constructed.

0人赞添加讨论(0) 举报

What is out of bag error in Random Forests? [close

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间