I've been looking for a C++ implementation of the C4.5 algorithm, but I haven't been able to find one yet. I found Quinlan's C4.5 Release 8, but it's written in C... has anybody seen any open source C++ implementations of the C4.5 algorithm?
I'm thinking about porting the J48 source code (or simply writing a wrapper around the C version) if I can't find an open source C++ implementation out there, but I hope I don't have to do that! Please let me know if you have come across a C++ implementation of the algorithm.
Update
I've been considering the option of writing a thin C++ wrapper around the C implementation of the C5.0 algorithm (C5.0 is the improved version of C4.5). I downloaded and compiled the C implementation of the C5.0 algorithm, but it doesn't look like it's easily portable to C++. The C implementation uses a lot of global variables and simply writing a thin C++ wrapper around the C functions will not result in an object oriented design because each class instance will be modifying the same global parameters. In other words: I will have no encapsulation and that's a pretty basic thing that I need.
In order to get encapsulation I will need to make a full blown port of the C code into C++, which is about the same as porting the Java version (J48) into C++.
Update 2.0
Here are some specific requirements:
- Each classifier instance must encapsulate its own data (i.e. no global variables aside from constant ones).
- Support the concurrent training of classifiers and the concurrent evaluation of the classifiers.
Here is a good scenario: suppose I'm doing 10-fold cross-validation, I would like to concurrently train 10 decision trees with their respective slice of the training set. If I just run the C program for each slice, I would have to run 10 processes, which is not horrible. However, if I need to classify thousands of data samples in real time, then I would have to start a new process for each sample I want to classify and that's not very efficient.
I may have found a possible C++ "implementation" of C5.0 (See5.0), but I haven't been able to dig into the source code enough to determine if it really works as advertised.
To reiterate my original concerns, the author of the port states the following about the C5.0 algorithm:
I will update my answer as soon as I get some time to look into the source code.
Update
It's looking pretty good, here is the C++ interface:
I would say that this is the best alternative I've found so far.
If I'm reading this correctly...it appears not to be organized as a C API, but as a C program. A data set is fed in, then it runs an algorithm and gives you back some rule descriptions.
I'd think the path you should take depends on whether you:
merely want a C++ interface for supplying data and retrieving rules from the existing engine, or...
want a C++ implementation that you can tinker with in order to tweak the algorithm to your own ends
If what you want is (1) then you could really just spawn the program as a process, feed it input as strings, and take the output as strings. That would probably be the easiest and most future-proof way of developing a "wrapper", and then you'd only have to develop C++ classes to represent the inputs and model the rule results (or match existing classes to these abstractions).
But if what you want is (2)...then I'd suggest trying whatever hacks you have in mind on top of the existing code in either C or Java--whichever you are most comfortable. You'll get to know the code that way, and if you have any improvements you may be able to feed them upstream to the author. If you build a relationship over the longer term then maybe you could collaborate and bring the C codebase slowly forward to C++, one aspect at a time, as the language was designed for.
Guess I just think the "When in Rome" philosophy usually works better than Port-In-One-Go, especially at the outset.
RESPONSE TO UPDATE: Process isolation takes care of your global variable issue. As for performance and data set size, you only have as many cores/CPUs and memory as you have. Whether you're using processes or threads usually isn't the issue when you're talking about matters of scale at that level. The overhead you encounter is if the marshalling is too expensive.
Prove the marshalling is the bottleneck, and to what extent... and you can build a case for why a process is a problem over a thread. But, there may be small tweaks to existing code to make marshalling cheaper which don't require a rewrite.
A C++ implementation for C4.5 called YaDT is available here, in the "Decision Trees" section:
http://www.di.unipi.it/~ruggieri/software.html
This is the source code for the last version:
http://www.di.unipi.it/~ruggieri/YaDT/YaDT1.2.5.zip
From the paper where the tool is described:
The paper is available here.