Writing our own models in openNLP

2020-03-30 04:50发布

If i use a query like this in command line

./opennlp TokenNameFinder en-ner-person.bin "input.txt" "output.txt"

I'll get person names printed in output.txt but I want to write own models such that i should print my own entities.

E.g.

  1. what is the risk value on icm2500.
  2. Delivery of prd_234 will be arrived late.
  3. Watson is handling router_34.

If i pass these lines, it should parse and extract product_entities. icm2500, prd_234, router_34... etc these are all Products( we can save this information in a file and we can use it as look up kind of for models or openNLP).

Can anyone please tel me how to do this ?

1条回答
萌系小妹纸
2楼-- · 2020-03-30 05:26

You'll need to train your own model by annotating some sentences in the opennlp format. For the example sentences you posted the format would look like this:

what is the risk value on <START:product> icm2500 <END>.
Delivery of <START:product> prd_234 <END> will be arrived late.
Watson is handling <START:product> router_34 <END>.

Make sure each sentence ends in a newline and if there are newlines in the sentence to escape them somehow. Once you make a file like this out of your data, then you can use the Java API to train the model like this

public static void main(String[] args){

Charset charset = Charset.forName("UTF-8");
ObjectStream<String> lineStream =
        new PlainTextByLineStream(new FileInputStream("your file in the above format"), charset);
ObjectStream<NameSample> sampleStream = new NameSampleDataStream(lineStream);

TokenNameFinderModel model;

try {
  model = NameFinderME.train("en", "person", sampleStream, TrainingParameters.defaultParams(),
            null, Collections.<String, Object>emptyMap());
}
finally {
  sampleStream.close();
}

try {
  modelOut = new BufferedOutputStream(new FileOutputStream(modelFile));
  model.serialize(modelOut);
} finally {
  if (modelOut != null) 
     modelOut.close();      
}

}

now you can use the model with the namefinder.

Because you may have a definitive, and possibly short, list of product names, you might consider a simple regex approach.

here's the opennlp docs that cover the NameFinder a bit:

http://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.namefind.training.tool
查看更多
登录 后发表回答