Predicting the “no class” / unrecognised class in

2019-08-18 15:18发布

问题:

I am using Weka 3.7 to classify text documents based on their content. I have a set of text files in folders and they all belong to a certain category.

Category A: 100 txt files
Category B: 100 txt files
...
Category X: 100 txt files

I want to predict if a document falls into one of the categories A-X, OR if it falls in the category UNRECOGNISED (for all other documents).

I am getting the total set of Instances programatically like this:

private Instances getTotalSet(){
    ArrayList<Attribute> listOfAttributes = new ArrayList<Attribute>(2);

    Attribute classAttribute = getClassAttribute();
    listOfAttributes.add(classAttribute);
    listOfAttributes.add(new Attribute("text", (ArrayList) null));

    Instances totalSet = new Instances("Rel", listOfAttributes,2);
    totalSet.setClassIndex(1);

    File[] classNamesFolders = new File(path).listFiles((FileFilter) FileFilterUtils.directoryFileFilter());
    for(File folder: classNamesFolders){
        if(folder.getName().equals("UNRECOGNISED")){
            continue;
        }
        System.out.println("Adding "+folder.getName());

        //all txt files in that subfolder
        for(File file : FileUtils.listFiles(folder.getAbsoluteFile(), new SuffixFileFilter(".txt"), DirectoryFileFilter.DIRECTORY)){
            try {
                Instance instance = new DenseInstance(2);
                instance.setValue(listOfAttributes.get(0), folder.getName());
                instance.setValue(listOfAttributes.get(1), FileUtils.readFileToString(file.getAbsoluteFile()));

                totalSet.add(instance);
            }catch(IOException e){
                System.out.println("Couldn't add "+e);
            }
        }
    }
    return totalSet;
}

I am using a RandomForest classifier in this case, (but that shouldn't make a difference for my question)

RandomForest rf = new RandomForest();
rf.setNumTrees(500);
rf.setMaxDepth(25);
rf.setSeed(1);
System.out.println("Building random forest with " + rf.getNumTrees() + " trees");
rf.buildClassifier(train);

When I make a prediction, I can see in which category the new document should fall, but how can I find out if the document should not belong in any category. While making the prediction I can access the

double pred = rf.classifyInstance(test.instance(i));    
double dist[] = rf.distributionForInstance(test.instance(i));

distribution for the instance, but how can I disambiguate if a document should not be recognised at all and have the category UNRECOGNISED.