I have a question about decision tree in MLlib
. What algorithm is used in Spark? Is it ID3, C4.5 or CART?
相关问题
- How to maintain order of key-value in DataFrame sa
- Highlight parent path to the root
- Spark on Yarn Container Failure
- Avoid overlapping of nodes in tree layout in d3.js
- In Spark Streaming how to process old data and del
相关文章
- Livy Server: return a dataframe as JSON?
- SQL query Frequency Distribution matrix for produc
- How to filter rows for a specific aggregate with s
- How to name file when saveAsTextFile in spark?
- Spark save(write) parquet only one file
- Could you give me any clue Why 'Cannot call me
- Why does the Spark DataFrame conversion to RDD req
- How do I enable partition pruning in spark
If you take a look at the link Apache Spark and take a look at the section,
You can find
Also, if you take a look at the link Decision Tree, you can find CART (classification and regression tree) algorithm uses Gini impurity and entropy for classification and variance reduction for regression.
Spark MLlib is using the ID3 algorithm with CART.
ID3 only handles categorical variables and CART can handle continuous variables. Spark decision trees can handle categorical variables, so it is using CART (in the Jira ticket specified below we can see that they haven't implemented C4.5 yet).
In this blog post you can find some information about the different algorithms and it is where I got the answer from.
You can find a discussion on extending it to C4.5 in this Jira ticket.
More information about the difference between the algorithms here.