Classifying data with naive bayes using LingPipe

2019-04-09 18:07发布

问题:

I want to classify certain data into different classes based on its content. I did it using naive bayes classifier and I get an output as the best category to which it belongs. But now I want to classify the news other than those in the training set into "others" class. I can't manually add each/every data other than the training data into a certain class since it has vast number of other categories.So is there any way to classify the other data?.

private static File TRAINING_DIR = new File("4news-train");
private static File TESTING_DIR = new File("4news-test");
private static String[] CATEGORIES = { "c1", "c2", "c3", "others" };

private static int NGRAM_SIZE = 6;

public static void main(String[] args) throws ClassNotFoundException, IOException {
    DynamicLMClassifier<NGramProcessLM> classifier = DynamicLMClassifier.createNGramProcess(CATEGORIES, NGRAM_SIZE);
    for (int i = 0; i < CATEGORIES.length; ++i) {
        File classDir = new File(TRAINING_DIR, CATEGORIES[i]);
        if (!classDir.isDirectory()) {
            String msg = "Could not find training directory=" + classDir + "\nTraining directory not found";
            System.out.println(msg); // in case exception gets lost in shell
            throw new IllegalArgumentException(msg);
        }

        String[] trainingFiles = classDir.list();
        for (int j = 0; j < trainingFiles.length; ++j) {
            File file = new File(classDir, trainingFiles[j]);
            String text = Files.readFromFile(file, "ISO-8859-1");
            System.out.println("Training on " + CATEGORIES[i] + "/" + trainingFiles[j]);
            Classification classification = new Classification(CATEGORIES[i]);
            Classified<CharSequence> classified = new Classified<CharSequence>(text, classification);
            classifier.handle(classified);
        }
    }
}

回答1:

Just serialize the object...it means write the intermediate object to a file and that will be your model...

Then for testing you just need to pass the data into the model no need to train each time...It will be quite easier for you



回答2:

Naive Bayes gives you the "confidence" in each classification, as it computes

P(y|x) ~ P(y)P(x|y)

Up to the normalization by P(x) it is a probability of x being a part of class y. You can simply cut-off on this value and say, that

cl(x) = "other" iff max_{over y}(P(y|x)) < T

where T can be for example minimum confidence on the training set

T = min_{over x and y in Training set}( P(y|x) )