Building a Decision Tree

2019-07-25 17:02发布

问题:

When building a decision tree, at each node, we select the best feature, and then the best splitting position for that feature. However, when all values for the best feature is 0 for samples in the current node /set, what do I do? All samples keep being grouped to one side (the <= 0 branch), and an infinite loop occurs. For example:

#left: 1500, #right: 0

then,

#left: 1500, #right: 0

and so on...

Just for reference, I'm following the following pseudo-code.

GrowTree(S)
if (y_i = C for all i in S and some class C) then {
 return new leaf(C)                             
 } else {
 choose best splitting feature j and splitting point beta (*)
 I choose the one that gives me the max entropy drop
 S_l = {i : X_ij < beta}                           
 S_r = {i : X_ij >= beta}
 return new node(j, beta, GrowTree(S_l), GrowTree(S_r))

}

回答1:

This is simply impossible. You are supposed to select threshold which leads to the biggest incrase of model certainty. Using threshold which puts every single instance in the same branch gives you 0 increase in models certainty, thus this is not the best split. This should happen if and only if, the impurity/entropy is already 0 in this feature, but then it is a stopping criterion for creating leaves in decision tree.