I am using Weka 3.7 to classify text documents based on their content. I have a set of text files in folders and they all belong to a certain category.
Category A: 100 txt files
Category B: 100 txt files
...
Category X: 100 txt files
I want to predict if a document falls into one of the categories A-X, OR if it falls in the category UNRECOGNISED (for all other documents).
I am getting the total set of Instances programatically like this:
private Instances getTotalSet(){
ArrayList<Attribute> listOfAttributes = new ArrayList<Attribute>(2);
Attribute classAttribute = getClassAttribute();
listOfAttributes.add(classAttribute);
listOfAttributes.add(new Attribute("text", (ArrayList) null));
Instances totalSet = new Instances("Rel", listOfAttributes,2);
totalSet.setClassIndex(1);
File[] classNamesFolders = new File(path).listFiles((FileFilter) FileFilterUtils.directoryFileFilter());
for(File folder: classNamesFolders){
if(folder.getName().equals("UNRECOGNISED")){
continue;
}
System.out.println("Adding "+folder.getName());
//all txt files in that subfolder
for(File file : FileUtils.listFiles(folder.getAbsoluteFile(), new SuffixFileFilter(".txt"), DirectoryFileFilter.DIRECTORY)){
try {
Instance instance = new DenseInstance(2);
instance.setValue(listOfAttributes.get(0), folder.getName());
instance.setValue(listOfAttributes.get(1), FileUtils.readFileToString(file.getAbsoluteFile()));
totalSet.add(instance);
}catch(IOException e){
System.out.println("Couldn't add "+e);
}
}
}
return totalSet;
}
I am using a RandomForest classifier in this case, (but that shouldn't make a difference for my question)
RandomForest rf = new RandomForest();
rf.setNumTrees(500);
rf.setMaxDepth(25);
rf.setSeed(1);
System.out.println("Building random forest with " + rf.getNumTrees() + " trees");
rf.buildClassifier(train);
When I make a prediction, I can see in which category the new document should fall, but how can I find out if the document should not belong in any category. While making the prediction I can access the
double pred = rf.classifyInstance(test.instance(i));
double dist[] = rf.distributionForInstance(test.instance(i));
distribution for the instance, but how can I disambiguate if a document should not be recognised at all and have the category UNRECOGNISED.