I used Weka Explorer:
- Loaded the arff file
- Applied StringToWordVector filter
- Selected IBk as the best classifier
- Generated/Saved my_model.model binary
In my java code I deserialize the model:
URL curl = ClassUtility.findClasspathResource( "models/my_model.model" );
final Classifier cls = (Classifier) weka.core.SerializationHelper.read( curl.openConnection().getInputStream() );
Now, I have the classifier BUT I need somehow the information on the filter. Where I am getting is: how do I prepare an instance to be classified by my deserialized model (how do I apply the filter before classification) - (The raw instance that I have to classify has a field text with tokens in it. The filter was supposed to transform that into a list of new atributes)
I even tried to use a FilteredClassifier where I set the classifier to the deserialized on and the filter to a manually created instance of StringToWordVector
final StringToWordVector filter = new StringToWordVector();
filter.setOptions(new String[]{"-C", "-P x_", "-L"});
FilteredClassifier fcls = new FilteredClassifier();
fcls.setFilter(filter);
fcls.setClassifier(cls);
The above does not work either. It throws the exception:
Exception in thread "main" java.lang.NullPointerException: No output instance format defined
What I am trying to avoid is doing the training in the java code. It can be very slow and the prospect is that I might have multiple classifiers to train (different algorithms as well) and I want my app to start fast.
Your problem is that your model doesn't know anything about what the filter did to the data. The StringToWordVector
filter changes the data, but depending on the input (training) data. A model trained on this transformed data set will only work on data that underwent the exact same transformation. To guarantee this, the filter needs to be part of your model.
Using a FilteredClassifier
is the correct idea, but you have to use it from the beginning:
- Load the ARFF file
- Select
FilteredClassifier
as classifier
- Select
StringToWordVector
as filter for it
- Select
IBk
as classifier for the FilteredClassifier
- Generate/Save the model to my_model.binary
The trained and serialized model will then also contain the intialized filter, including the information on how to transform data.
Another way to do this is to use the same filter to your testing data as the one used on training data. I describe the procedure analytically. In your case you just need to follow steps after the loading of your serialized classifier.
- Create your training file (e.g training.arff)
- Create Instances from training file.
Instances trainingData = ..
- Use StringToWordVector to transform your string attributes to number representation:
sample code:
StringToWordVector() filter = new StringToWordVector();
filter.setWordsToKeep(1000000);
if(useIdf){
filter.setIDFTransform(true);
}
filter.setTFTransform(true);
filter.setLowerCaseTokens(true);
filter.setOutputWordCounts(true);
filter.setMinTermFreq(minTermFreq);
filter.setNormalizeDocLength(new SelectedTag(StringToWordVector.FILTER_NORMALIZE_ALL,StringToWordVector.TAGS_FILTER));
NGramTokenizer t = new NGramTokenizer();
t.setNGramMaxSize(maxGrams);
t.setNGramMinSize(minGrams);
filter.setTokenizer(t);
WordsFromFile stopwords = new WordsFromFile();
stopwords.setStopwords(new File("data/stopwords/stopwords.txt"));
filter.setStopwordsHandler(stopwords);
if (useStemmer){
Stemmer s = new /*Iterated*/LovinsStemmer();
filter.setStemmer(s);
}
filter.setInputFormat(trainingData);
Apply the filter to trainingData: trainingData = Filter.useFilter(trainingData, filter);
Select a classifier to create your model
sample code for LibLinear classifier
Classifier cls = null;
LibLINEAR liblinear = new LibLINEAR();
liblinear.setSVMType(new SelectedTag(0, LibLINEAR.TAGS_SVMTYPE));
liblinear.setProbabilityEstimates(true);
// liblinear.setBias(1); // default value
cls = liblinear;
cls.buildClassifier(trainingData);
sample code
System.out.println("Saving the model...");
ObjectOutputStream oos;
oos = new ObjectOutputStream(new FileOutputStream(path+"mymodel.model"));
oos.writeObject(cls);
oos.flush();
oos.close();
sample code
Classifier myCls = (Classifier) weka.core.SerializationHelper.read(path+"mymodel.model");
Use the same StringToWordVector filter as above or create a new one for testingData, but remember to use the trainingData for this command:filter.setInputFormat(trainingData);
This will keep the format of training set and will not add words that are not in training set.
Apply the filter to testingData: testingData = Filter.useFilter(testingData, filter);
Classify!
sample code
for (int j = 0; j < testingData.numInstances(); j++) {
double res = myCls.classifyInstance(testingData.get(j));
}