I'm testing with openNLP library to implemented automation in categorizing content but i have trouble. I'm using this code and it returns always the first category that i have in my training data which i'm passing full article from any news site.
public void trainModel() {
try {
InputStreamFactory inputStreamFactory = new MarkableFileInputStreamFactory( new File("C:\\Users\\emehm\\Desktop\\data\\training_data.txt") );
ObjectStream<String> lineStream = new PlainTextByLineStream(inputStreamFactory, "UTF-8");
ObjectStream<DocumentSample> sampleStream = new DocumentSampleStream(lineStream);
DoccatModel model = DocumentCategorizerME.train("en", sampleStream, TrainingParameters.defaultParams(), new DoccatFactory());
DocumentCategorizerME myCategorizer = new DocumentCategorizerME(model);
double[] outcomes = myCategorizer.categorize( new String[]{ this.getFileContent() });
String category = myCategorizer.getBestCategory(outcomes);
Map<String, Double> map = myCategorizer.scoreMap(new String[]{ this.getFileContent() });
System.out.println(category);
} catch (IOException e) {
// Failed to read or parse training data, training failed
e.printStackTrace();
}
}
public String getFileContent() throws IOException {
InputStream is = new FileInputStream("C:\\Users\\emehm\\Desktop\\data\\statija.txt");
BufferedReader buf = new BufferedReader(new InputStreamReader(is));
String line = buf.readLine();
StringBuilder sb = new StringBuilder();
while (line != null) {
sb.append(line).append("\n");
line = buf.readLine();
}
buf.close();
return sb.toString();
}
Training data: http://pastebin.com/ZhxswkvJ
Article i'm using: http://pastebin.com/xtABGcbh
it always returns the the first category from the list and i want to know what i'm missing? when i debug it it returns 0.25 score for all of them and picks first of them for some reason. when i test one word it works i guess but it's not working with an article.
Input needs to be divided into individual words, ie split by spaces.
Change this:
double[] outcomes = myCategorizer.categorize( new String[]{ this.getFileContent() });
to this:
double[] outcomes = myCategorizer.categorize( this.getFileContent().split(" ") );
After, you should have better results. It's important to note the effectiveness is tied to the quality of the model.