Why the decision tree structure is only binary tre

2020-03-24 07:28发布

问题:

As we can see from the sklearn document here, or from my experiment, all the tree structure of DecisionTreeClassifier is binary tree. Either the criterion is gini or entropy, each DecisionTreeClassifier node can only has 0 or 1 or 2 child node.

But from the decision tree introduction slide(page 3), each node of theoretic decision tree can has more than 2 child node.

So my question is why the decision tree structure is only binary tree (each DecisionTreeClassifier node can only has 1 or 2 child node.) for sklearn DecisionTreeClassifier? Can we get the tree structure with more than 2 child node for DecisionTreeClassifier?

回答1:

It is because sklearn's approach is to work with numerical features, not categorical, when you have numerical feature, it is relatively hard to build a nice splitting rule which can have arbitrary number of thresholds (which is required to produce more than 2 children). For categorical features, on the other hand (used in the slides provided), another possible option is to have as many children as possible values. Both approach have its own problems (categorical approach makes it nearly intracktable when you have plenty of possible values) and numerical requires particular features encoding (one hot for categorical, which efficiently means that you can still express the same tree, but instead of having "species" with 3 children [dog, cat, human] you will have deeper tree with decisions: [dog, not dog], [not dog but cat, not dog, not cat but human]).

So the short answer is no, you cannot achieve more than 2 children with this implementation, however this is not something truly limiting in general.