openNLP categorize content return always first cat

I'm testing with openNLP library to implemented automation in categorizing content but i have trouble. I'm using this code and it returns always the first category that i have in my training data which i'm passing full article from any news site.

    public void trainModel() {
        try {
            InputStreamFactory inputStreamFactory = new MarkableFileInputStreamFactory( new File("C:\\Users\\emehm\\Desktop\\data\\training_data.txt") );
            ObjectStream<String> lineStream = new PlainTextByLineStream(inputStreamFactory, "UTF-8");
            ObjectStream<DocumentSample> sampleStream = new DocumentSampleStream(lineStream);

            DoccatModel model = DocumentCategorizerME.train("en", sampleStream, TrainingParameters.defaultParams(), new DoccatFactory());
            DocumentCategorizerME myCategorizer = new DocumentCategorizerME(model);
            double[] outcomes = myCategorizer.categorize(  new String[]{ this.getFileContent() });
            String category = myCategorizer.getBestCategory(outcomes);
            Map<String, Double> map = myCategorizer.scoreMap(new String[]{ this.getFileContent() });
            System.out.println(category);
        } catch (IOException e) {
            // Failed to read or parse training data, training failed
            e.printStackTrace();
        }
    }

    public String getFileContent() throws IOException {
        InputStream is = new FileInputStream("C:\\Users\\emehm\\Desktop\\data\\statija.txt");
        BufferedReader buf = new BufferedReader(new InputStreamReader(is));
        String line = buf.readLine();
        StringBuilder sb = new StringBuilder();
        while (line != null) {
            sb.append(line).append("\n");
            line = buf.readLine();
        }
        buf.close();
        return sb.toString();
    }

Training data: http://pastebin.com/ZhxswkvJ

Article i'm using: http://pastebin.com/xtABGcbh

it always returns the the first category from the list and i want to know what i'm missing? when i debug it it returns 0.25 score for all of them and picks first of them for some reason. when i test one word it works i guess but it's not working with an article.

标签： machine-learning nlp opennlp

1条回答

我欲成王，谁敢阻挡

2楼-- · 2019-04-11 02:21

Input needs to be divided into individual words, ie split by spaces.

Change this: double[] outcomes = myCategorizer.categorize( new String[]{ this.getFileContent() });

to this: double[] outcomes = myCategorizer.categorize( this.getFileContent().split(" ") );

After, you should have better results. It's important to note the effectiveness is tied to the quality of the model.

0人赞添加讨论(0) 举报

openNLP categorize content return always first cat

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间