I've read from this documentation that :
"Class balancing can be done by sampling an equal number of samples from each class, or preferably by normalizing the sum of the sample weights (sample_weight) for each class to the same value."
But, it is still unclear to me how this works. If I set sample_weight
with an array of only two possible values, 1
's and 2
's, does this mean that the samples with 2
's will get sampled twice as often as the samples with 1
's when doing the bagging? I cannot think of a practical example for this.
So I spent a little time looking at the sklearn source because I've actually been meaning to try to figure this out myself for a little while now, too. I apologize for the length, but I don't know how to explain it more briefly.
Some quick preliminaries:
Let's say we have a classification problem with K classes. In a region of feature space represented by the node of a decision tree, recall that the "impurity" of the region is measured by quantifying the inhomogeneity, using the probability of the class in that region. Normally, we estimate:
The impurity measure takes as input, the array of class probabilities:
and spits out a number, which tells you how "impure" or how inhomogeneous-by-class the region of feature space is. For example, the gini measure for a two class problem is
2*p*(1-p)
, wherep = Pr(Class=1)
and1-p=Pr(Class=2)
.Now, basically the short answer to your question is:
sample_weight
augments the probability estimates in the probability array ... which augments the impurity measure ... which augments how nodes are split ... which augments how the tree is built ... which augments how feature space is diced up for classification.I believe this is best illustrated through example.
First consider the following 2-class problem where the inputs are 1 dimensional:
So, we'll look trees with just a root node and two children. Note that the default impurity measure the gini measure.
Case 1: no
sample_weight
The first value in the
threshold
array tells us that the 1st training example is sent to the left child node, and the 2nd and 3rd training examples are sent to the right child node. The last two values inthreshold
are placeholders and are to be ignored. Theimpurity
array tells us the computed impurity values in the parent, left, and right nodes respectively.In the parent node,
p = Pr(Class=1) = 2. / 3.
, so thatgini = 2*(2.0/3.0)*(1.0/3.0) = 0.444....
. You can confirm the child node impurities as well.Case 2: with
sample_weight
Now, let's try:
You can see the feature threshold is different.
sample_weight
also affects the impurity measure in each node. Specifically, in the probability estimates, the first training example is counted the same, the second is counted double, and the third is counted triple, due to the sample weights we've provided.The impurity in the parent node region is the same. This is just a coincidence. We can compute it directly:
The gini measure of
4/9
follows.Now, you can see from the chosen threshold that the first and second training examples are sent to the left child node, while the third is sent to the right. We see that impurity is calculated to be
4/9
also in the left child node because:The impurity of zero in the right child is due to only one training example lying in that region.
You can extend this with non-integer sample-wights similarly. I recommend trying something like
sample_weight = [1,2,2.5]
, and confirming the computed impurities.Hope this helps!
If anyone is just looking for the way to compute the sample_weight like I was, you might find this handy.
sklearn.utils.class_weight.compute_sample_weight