I've read from this documentation that :

"Class balancing can be done by sampling an equal number of samples from each class, or preferably by normalizing the sum of the sample weights (sample_weight) for each class to the same value."

But, it is still unclear to me how this works. If I set sample_weight with an array of only two possible values, 1's and 2's, does this mean that the samples with 2's will get sampled twice as often as the samples with 1's when doing the bagging? I cannot think of a practical example for this.

标签： scikit-learn random-forest decision-tree

2条回答

我命由我不由天

2楼-- · 2019-01-13 21:50

So I spent a little time looking at the sklearn source because I've actually been meaning to try to figure this out myself for a little while now, too. I apologize for the length, but I don't know how to explain it more briefly.

Some quick preliminaries:

Let's say we have a classification problem with K classes. In a region of feature space represented by the node of a decision tree, recall that the "impurity" of the region is measured by quantifying the inhomogeneity, using the probability of the class in that region. Normally, we estimate:

Pr(Class=k) = #(examples of class k in region) / #(total examples in region)

The impurity measure takes as input, the array of class probabilities:

[Pr(Class=1), Pr(Class=2), ..., Pr(Class=K)]

and spits out a number, which tells you how "impure" or how inhomogeneous-by-class the region of feature space is. For example, the gini measure for a two class problem is 2*p*(1-p), where p = Pr(Class=1) and 1-p=Pr(Class=2).

Now, basically the short answer to your question is:

sample_weight augments the probability estimates in the probability array ... which augments the impurity measure ... which augments how nodes are split ... which augments how the tree is built ... which augments how feature space is diced up for classification.

I believe this is best illustrated through example.

First consider the following 2-class problem where the inputs are 1 dimensional:

from sklearn.tree import DecisionTreeClassifier as DTC

X = [[0],[1],[2]] # 3 simple training examples
Y = [ 1,  2,  1 ] # class labels

dtc = DTC(max_depth=1)

So, we'll look trees with just a root node and two children. Note that the default impurity measure the gini measure.

Case 1: no `sample_weight`

dtc.fit(X,Y)
print dtc.tree_.threshold
# [0.5, -2, -2]
print dtc.tree_.impurity
# [0.44444444, 0, 0.5]

The first value in the threshold array tells us that the 1st training example is sent to the left child node, and the 2nd and 3rd training examples are sent to the right child node. The last two values in threshold are placeholders and are to be ignored. The impurity array tells us the computed impurity values in the parent, left, and right nodes respectively.

In the parent node, p = Pr(Class=1) = 2. / 3., so that gini = 2*(2.0/3.0)*(1.0/3.0) = 0.444..... You can confirm the child node impurities as well.

Case 2: with `sample_weight`

Now, let's try:

dtc.fit(X,Y,sample_weight=[1,2,3])
print dtc.tree_.threshold
# [1.5, -2, -2]
print dtc.tree_.impurity
# [0.44444444, 0.44444444, 0.]

You can see the feature threshold is different. sample_weight also affects the impurity measure in each node. Specifically, in the probability estimates, the first training example is counted the same, the second is counted double, and the third is counted triple, due to the sample weights we've provided.

The impurity in the parent node region is the same. This is just a coincidence. We can compute it directly:

p = Pr(Class=1) = (1+3) / (1+2+3) = 2.0/3.0

The gini measure of 4/9 follows.

Now, you can see from the chosen threshold that the first and second training examples are sent to the left child node, while the third is sent to the right. We see that impurity is calculated to be 4/9 also in the left child node because:

p = Pr(Class=1) = 1 / (1+2) = 1/3.

The impurity of zero in the right child is due to only one training example lying in that region.

You can extend this with non-integer sample-wights similarly. I recommend trying something like sample_weight = [1,2,2.5], and confirming the computed impurities.

Hope this helps!

0人赞添加讨论(0) 举报

别忘想泡老子

3楼-- · 2019-01-13 22:16

If anyone is just looking for the way to compute the sample_weight like I was, you might find this handy.

sklearn.utils.class_weight.compute_sample_weight

0人赞添加讨论(0) 举报

What does `sample_weight` do to the way a `Decisio

Case 1: no sample_weight

Case 2: with sample_weight

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间

Case 1: no `sample_weight`

Case 2: with `sample_weight`