How does the C4.5 Algorithm handle continuous data

2020-06-23 08:52发布

I am implementing the C4.5 algorithm in .net, however I don't have clear idea of how it deals "continuous (numeric) data". Could someone give me a more detailed explanation?

标签： .net algorithm decision-tree continuous c4.5

1条回答

欢心

2楼-- · 2020-06-23 09:28

For continuous data C4.5 uses a threshold value where everything less than the threshold is in the left node, and everything greater than the threshold goes in the right node. The question is how to create that threshold value from the data you're given. The trick there is to sort your data by the continuous variable in ascending order. Then iterate over the data picking a threshold between data members. For example if your data for attribute x is:

0.5, 1.2, 3.4, 5.4, 6.0

You first pick a threshold between 0.5 and 1.2. In this case we can just use the average: 0.85. Now compute your impurity:

H(x < 0.85) = H(s) - l/N * H(x<0.85) - r/N * H(x>0.85).

Where l is the number of samples in the left node, r is the number of samples in the right node, and N is the total number of samples in the node being split. In our example above with x>0.85 as our split then l=1, r=4, and N=5.

Remember the computed impurity difference, and now compute it for the split between 2 and 3 (ie x>2.3). Repeat that for every split (ie n-1 splits). Then pick the split that minimized H the most. That means your split should be more pure than not splitting. If you can't increase the purity for the resulting nodes then don't split it. You can also have a minimum node size so you don't end up with the left or right nodes containing only one sample in them.

0人赞添加讨论(0) 举报

How does the C4.5 Algorithm handle continuous data

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间