I am doing research on data mining and more precisely, decision trees.
I would like to know if there are multiple algorithms to build a decision trees (or just one?), and which is better, based on criteria such as
- Performance
- Complexity
- Errors in decision making
- and more.
Decision Tree implementations differ primarily along these axes:
the splitting criterion (i.e., how "variance" is calculated)
whether it builds models for regression (continuous variables, e.g., a
score) as well as classification (discrete variables, e.g., a class
label)
technique to eliminate/reduce over-fitting
whether it can handle incomplete data
The major Decision Tree implementations are:
ID3, or Iterative Dichotomizer, was the first of three Decision Tree
implementations developed by Ross Quinlan (Quinlan, J. R. 1986. Induction of Decision Trees. Mach. Learn. 1, 1 (Mar. 1986), 81-106.)
CART, or Classification And Regression Trees is often used as a generic
acronym for the term Decision Tree, though it apparently has a more specific meaning. In sum, the CART implementation is very similar to C4.5; the one notable difference is that CART constructs the tree based on a numerical splitting criterion recursively applied to the data, whereas C4.5 includes the intermediate step of constructing rule sets.
C4.5, Quinlan's next iteration. The new features (versus ID3) are:
(i) accepts both continuous and discrete features; (ii) handles
incomplete data points; (iii) solves over-fitting problem by (very
clever) bottom-up technique usually known as "pruning"; and (iv)
different weights can be applied the features that comprise the
training data. Of these, the first three are very important--and I would suggest that any DT implementation you choose has all three. The fourth (differential weighting) is much less important
C5.0, the most recent Quinlan iteration. This implementation is
covered by a patent and probably, as a result, is rarely implemented
(outside of commercial software packages). I have never coded a C5.0
implementation myself (I have never even seen the source code) so I can't offer an informed comparison of C5.0 versus C4.5. I have always
been skeptical about the improvements claimed by its inventor (Ross
Quinlan)--for instance, he claims it is "several orders of magnitude"
faster than C4.5. Other claims are similarly broad ("significantly more memory efficient") and so forth. I'll just point you to studies
which report the result of comparison of the two techniques and you can decide for yourself.
CHAID (chi-square automatic interaction detector) actually predates
the original ID3 implementation by about six years (published in a
Ph.D. thesis by Gordon Kass in 1980). I know every little about this technique.The R Platform has a Package called CHAID which
includes excellent documentation
MARS (multi-adaptive regression splines) is actually a term trademarked by the original inventor of MARS, Salford Systems. As a
result, MARS clones in libraries not sold by Salford are named something other than MARS--e.g., in R, the relevant function is polymars in the poly-spline library. Matlab and Statistica also have
implementations with MARS-functionality
I would recommend CART or C4.5 (though again, I have no direct experience with C5.0 or with CHAID, though I am familiar with their feature sets).
C4.5 is the Decision Tree flavor implemented in Orange; CART is the flavor in sklearn--both excellent implementations in excellent ML libraries.
C4.5 is a major step beyond ID3--both in terms of range (C4.5 has a far broader use case spectrum because it can handle continuous variables in the training data) and in terms of model quality.
Perhaps the most significant claimed improvement of C5.0 versus C4.5 is support for boosted trees. Ensemble support for DTs--boosted trees and Random Forests--has been included in the DT implementation in Orange; here, ensemble support was added to a C4.5 algorithm. sklearn also features a range of random forest and boosting methods.