What algorithm is used in spark decision tree (is

2020-04-03 03:59发布

I have a question about decision tree in MLlib. What algorithm is used in Spark? Is it ID3, C4.5 or CART?

2条回答
别忘想泡老子
2楼-- · 2020-04-03 04:36

If you take a look at the link Apache Spark and take a look at the section,

Node impurity and information gain (Basic Algorithm)

You can find

The current implementation provides two impurity measures for classification (Gini impurity and entropy) and one impurity measure for regression (variance)

Also, if you take a look at the link Decision Tree, you can find CART (classification and regression tree) algorithm uses Gini impurity and entropy for classification and variance reduction for regression.

查看更多
Emotional °昔
3楼-- · 2020-04-03 04:46

Spark MLlib is using the ID3 algorithm with CART.

ID3 only handles categorical variables and CART can handle continuous variables. Spark decision trees can handle categorical variables, so it is using CART (in the Jira ticket specified below we can see that they haven't implemented C4.5 yet).

In this blog post you can find some information about the different algorithms and it is where I got the answer from.

You can find a discussion on extending it to C4.5 in this Jira ticket.

More information about the difference between the algorithms here.

查看更多
登录 后发表回答