Weka J48 Classifier: Cannot handle numeric class?

2019-02-16 02:06发布

问题:

I'm now trying to build a J48 (C4.5) classifier model on my training data using Weka.

First I do this, which seems to go OK:

java -Xmx10G -cp /weka/weka.jar weka.core.converters.TextDirectoryLoader -dir /home/test/cats > /home/test/cats.arff

This seems to go OK too:

java -Xmx10G -cp /weka/weka.jar weka.filters.unsupervised.attribute.StringToWordVector -i /home/test/cats.arff -o /home/test/cats-vector.arff

This does not go OK:

java -Xmx10G -cp /weka/weka.jar weka.classifiers.trees.J48 -t /home/test/cats-vector.arff -d /home/test/cats.model

It gives the following error:

weka.core.UnsupportedAttributeTypeException: weka.classifiers.trees.j48.C45Prune                 ableClassifierTree: Cannot handle numeric class!
        at weka.core.Capabilities.test(Capabilities.java:954)
        at weka.core.Capabilities.test(Capabilities.java:1110)
        at weka.core.Capabilities.test(Capabilities.java:1023)
        at weka.core.Capabilities.testWithFail(Capabilities.java:1302)
        at weka.classifiers.trees.j48.C45PruneableClassifierTree.buildClassifier                 (C45PruneableClassifierTree.java:116)
        at weka.classifiers.trees.J48.buildClassifier(J48.java:236)
        at weka.classifiers.Evaluation.evaluateModel(Evaluation.java:1076)
        at weka.classifiers.Classifier.runClassifier(Classifier.java:312)
        at weka.classifiers.trees.J48.main(J48.java:948)

So I then tried this:

java -Xmx10G -cp /weka/weka.jar weka.classifiers.trees.J48 -t /home/test/cats.arff -d /home/test/cats.model

Which also gives the error:

weka.core.UnsupportedAttributeTypeException: weka.classifiers.trees.j48.C45PruneableClassifierTree: Cannot handle string attributes!
        at weka.core.Capabilities.test(Capabilities.java:980)
        at weka.core.Capabilities.test(Capabilities.java:869)
        at weka.core.Capabilities.test(Capabilities.java:1085)
        at weka.core.Capabilities.test(Capabilities.java:1023)
        at weka.core.Capabilities.testWithFail(Capabilities.java:1302)
        at weka.classifiers.trees.j48.C45PruneableClassifierTree.buildClassifier(C45PruneableClassifierTree.java:116)
        at weka.classifiers.trees.J48.buildClassifier(J48.java:236)
        at weka.classifiers.Evaluation.evaluateModel(Evaluation.java:1076)
        at weka.classifiers.Classifier.runClassifier(Classifier.java:312)
        at weka.classifiers.trees.J48.main(J48.java:948)

Obviously I've prepared the data wrong somehow (BTW the input is text files in subdirectories which are named by the categories that I want). But I thought I was following the instructions on the Weka Wiki: Weka Wiki Categorizing Text Files Weka Wiki Primer

So what am I doing wrong? I would like to use J48 because it's given high accuracy on my data in tests. So what do I do to my data to get the J48 classifier to accept it? Or do I need to use a different classifier?

Please help!

回答1:

The J48 classifier is a tree classifier which only accept nominal classes. Meaning that the classes according to which you will classify your instances must be known before hand. IE, if you are trying to predict a rating and you know that the rating is on a 5-level Likert scale you have to explicitly say so in your ARFF file with something like @attribute class {1,2,3,4,5}, but if you to predict the weight of a person then this value is probably a real number and therefore cannot 'fit' in a tree classification. NB: one way to go around that would be to create a sampling of the weights available: from 10 to 15 kg, from 15 to 20 kg etc. This way you could have a nominal class attribute.



回答2:

The word vectors could be converted to binary like this:

java -Xmx4G -cp /weka/weka.jar weka.filters.unsupervised.attribute.NumericToBinary -i /home/test/cats-vector.arff -o /home/test/cats-binary.arff

Although this adds bias to the kind of data you are training against. This implies that binary strings very close to one-another are treated as more similar to strings far away. If you want to erase this bias and regard each string as a totally unique entity then use @attribute class {ABC, DEF, GHI, etc} Then it works!

If you really want to communicate that these features are important and not-at-all related, make a whole column for each string, where it has the value '1' for when a row has that category, and 0 when it does not. This creates very sparse data, but then the learning algorithm has a bias to scan that data for information gain.