CART algorithm of matlab 'fitctree' takes

here is an example mentionning that fitctree of matlab takes into account the features order ! why ?

load ionosphere % Contains X and Y variables
Mdl = fitctree(X,Y)
view(Mdl,'mode','graph');
X1=fliplr(X);
Mdl1 = fitctree(X1,Y)
view(Mdl1,'mode','graph');

Not the same model, thus not the same classification accuracy despite dealing with the same features ?

标签： matlab tree classification cart decision-tree

1条回答

在下西门庆

2楼-- · 2019-02-26 01:42

In your example, Xcontains 34 predictors. The predictors contain no names and fitctreejust refers to them by their column numbers x1, x2, ..., x34. If you flip the table, the column number changes and therefore their name. So x1 -> x34. x2 -> x33, etc..

In for most nodes this does not matter because CART always divides a node by the predictor that maximises the impurity gain between the two child nodes. But sometimes there are multiple predictors which result in the same impurity gain. Then it just picks the one with the lowest column number. And since the column number changed by reordering the predictors, you end up with a different predictor at that node.

E.g. let's look at the marked split:

Original order (mdl): Flipped order (mdl1):

Up to this point always the same predictor and values have been chosen. Names changed due to order, e.g. x5 in the old data = x30 in the new model. But x3 and x6 are actually different predictors. x6 in the flipped order is x29 in the original order.

A scatter plot between those predictors shows how this could happen:

Where blue and cyan lines mark the splits performed by mdl and mdl1 respectively at that node. As we can see, both splits yield child nodes with the same number of elements per label! Therefore CART can chose any of the two predictors, it will cause the same impurity gain.

In that case it seems to just pick the one with the lower column number. In the non-flipped table x3 is chosen instead of x29 because 3 < 29. But if you flip the tables, x3 becomes x32 and x29 becomes x6. Since 6 < 32 you now end up with x6, the original x29.

Ultimately this does not matter - the decision tree of the flipped table is not better or worse. It only happens in the lower nodes where the tree starts to overfit. So you really don't have to care about it.

Appendix:

Code for scatter plot generation:

load ionosphere % Contains X and Y variables
Mdl = fitctree(X,Y);
view(Mdl,'mode','graph');
X1=fliplr(X);
Mdl1 = fitctree(X1,Y);
view(Mdl1,'mode','graph');

idx = (X(:,5)>=0.23154 & X(:,27)>=0.999945 & X(:,1)>=0.5);
remainder = X(idx,:);
labels = cell2mat(Y(idx,:));

gscatter(remainder(:,3), remainder(:,(35-6)), labels,'rgb','osd');

limits = [-1.5 1.5];
xlim(limits)
ylim(limits)
xlabel('predictor 3')
ylabel('predictor 29')
hold on
plot([0.73 0.73], limits, '-b')
plot(limits, [0.693 0.693], '-c')
legend({'b' 'g'})

0人赞添加讨论(0) 举报

CART algorithm of matlab 'fitctree' takes

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间