CART algorithm of matlab 'fitctree' takes

2019-02-26 00:57发布

here is an example mentionning that fitctree of matlab takes into account the features order ! why ?

load ionosphere % Contains X and Y variables
Mdl = fitctree(X,Y)
view(Mdl,'mode','graph');
X1=fliplr(X);
Mdl1 = fitctree(X1,Y)
view(Mdl1,'mode','graph');

Not the same model, thus not the same classification accuracy despite dealing with the same features ?

1条回答
在下西门庆
2楼-- · 2019-02-26 01:42

In your example, Xcontains 34 predictors. The predictors contain no names and fitctreejust refers to them by their column numbers x1, x2, ..., x34. If you flip the table, the column number changes and therefore their name. So x1 -> x34. x2 -> x33, etc..

In for most nodes this does not matter because CART always divides a node by the predictor that maximises the impurity gain between the two child nodes. But sometimes there are multiple predictors which result in the same impurity gain. Then it just picks the one with the lowest column number. And since the column number changed by reordering the predictors, you end up with a different predictor at that node.

E.g. let's look at the marked split:

Original order (mdl): noflip Flipped order (mdl1): flip

Up to this point always the same predictor and values have been chosen. Names changed due to order, e.g. x5 in the old data = x30 in the new model. But x3 and x6 are actually different predictors. x6 in the flipped order is x29 in the original order.

A scatter plot between those predictors shows how this could happen:

complot

Where blue and cyan lines mark the splits performed by mdl and mdl1 respectively at that node. As we can see, both splits yield child nodes with the same number of elements per label! Therefore CART can chose any of the two predictors, it will cause the same impurity gain.

In that case it seems to just pick the one with the lower column number. In the non-flipped table x3 is chosen instead of x29 because 3 < 29. But if you flip the tables, x3 becomes x32 and x29 becomes x6. Since 6 < 32 you now end up with x6, the original x29.

Ultimately this does not matter - the decision tree of the flipped table is not better or worse. It only happens in the lower nodes where the tree starts to overfit. So you really don't have to care about it.

Appendix:

Code for scatter plot generation:

load ionosphere % Contains X and Y variables
Mdl = fitctree(X,Y);
view(Mdl,'mode','graph');
X1=fliplr(X);
Mdl1 = fitctree(X1,Y);
view(Mdl1,'mode','graph');

idx = (X(:,5)>=0.23154 & X(:,27)>=0.999945 & X(:,1)>=0.5);
remainder = X(idx,:);
labels = cell2mat(Y(idx,:));

gscatter(remainder(:,3), remainder(:,(35-6)), labels,'rgb','osd');

limits = [-1.5 1.5];
xlim(limits)
ylim(limits)
xlabel('predictor 3')
ylabel('predictor 29')
hold on
plot([0.73 0.73], limits, '-b')
plot(limits, [0.693 0.693], '-c')
legend({'b' 'g'})
查看更多
登录 后发表回答