How to create Custom model using OpenNLP?

2020-06-23 08:33发布

问题:

I am trying to extract entities like Names, Skills from document using OpenNLP Java API. but it is not extracting proper Names. I am using model available on opennlp sourceforge link

Here is a piece of java code-

public class tikaOpenIntro {

    public static void main(String[] args) throws IOException, SAXException,
            TikaException {

        tikaOpenIntro toi = new tikaOpenIntro();
        toi.filest("");
        String cnt = toi.contentEx();
        toi.sentenceD(cnt);
        toi.tokenization(cnt);

        String names = toi.namefind(toi.Tokens);
        toi.files(names);

    }

    public String Tokens[];

    public String contentEx() throws IOException, SAXException, TikaException {
        InputStream is = new BufferedInputStream(new FileInputStream(new File(
                "/home/rahul/Downloads/rahul.pdf")));
        // URL url=new URL("http://in.linkedin.com/in/rahulkulhari");
        // InputStream is=url.openStream();
        Parser ps = new AutoDetectParser(); // for detect parser related to

        BodyContentHandler bch = new BodyContentHandler();

        ps.parse(is, bch, new Metadata(), new ParseContext());

        return bch.toString();

    }

    public void files(String st) throws IOException {
        FileWriter fw = new FileWriter("/home/rahul/Documents/extrdata.txt",
                true);
        BufferedWriter bufferWritter = new BufferedWriter(fw);
        bufferWritter.write(st + "\n");
        bufferWritter.close();
    }

    public void filest(String st) throws IOException {
        FileWriter fw = new FileWriter("/home/rahul/Documents/extrdata.txt",
                false);

        BufferedWriter bufferWritter = new BufferedWriter(fw);
        bufferWritter.write(st);
        bufferWritter.close();
    }

    public String namefind(String cnt[]) {
        InputStream is;
        TokenNameFinderModel tnf;
        NameFinderME nf;
        String sd = "";
        try {
            is = new FileInputStream(
                    "/home/rahul/opennlp/model/en-ner-person.bin");
            tnf = new TokenNameFinderModel(is);
            nf = new NameFinderME(tnf);

            Span sp[] = nf.find(cnt);

            String a[] = Span.spansToStrings(sp, cnt);
            StringBuilder fd = new StringBuilder();
            int l = a.length;

            for (int j = 0; j < l; j++) {
                fd = fd.append(a[j] + "\n");

            }
            sd = fd.toString();

        } catch (FileNotFoundException e) {

            e.printStackTrace();
        } catch (InvalidFormatException e) {

            e.printStackTrace();
        } catch (IOException e) {

            e.printStackTrace();
        }
        return sd;
    }


    public void sentenceD(String content) {
        String cnt[] = null;
        InputStream om;
        SentenceModel sm;
        SentenceDetectorME sdm;
        try {
            om = new FileInputStream("/home/rahul/opennlp/model/en-sent.bin");
            sm = new SentenceModel(om);
            sdm = new SentenceDetectorME(sm);
            cnt = sdm.sentDetect(content);

        } catch (IOException e) {
            e.printStackTrace();
        }

    }

    public void tokenization(String tokens) {

        InputStream is;
        TokenizerModel tm;

        try {
            is = new FileInputStream("/home/rahul/opennlp/model/en-token.bin");
            tm = new TokenizerModel(is);
            Tokenizer tz = new TokenizerME(tm);
            Tokens = tz.tokenize(tokens);
            // System.out.println(Tokens[1]);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

}

what am i trying to do is :

  • i am using Apache Tika to convert PDF document into plain text document.
  • I am passing plain text document for sentence boundary detection.
  • After this tokenization
  • after this Name entity extraction

But it is extracting names and other words. It is not extract proper names. and how to create Custom model to extract Skills from document like Swimming, Programming etc?

Give me some idea!

Any help will be greatly appreciated!?

回答1:

It sounds like you're not happy with the performance of the pre-built name model for OpenNLP. But (a) models are never perfect, and even the best model will miss some things it should have caught and catch some things it should have missed; and (b) the model will perform best if the documents the model was trained on match the documents you're trying to tag, in genre and text style (so a model trained on mixed case text won't work very well on all-caps text, and a model trained on news articles won't work well on, say, tweets). You can try other publicly available tools, like the Stanford NE toolkit, or LingPipe; they may have better-performing models. But none of them are going to be perfect.

To create a custom model, you'll need to produce some training data. For OpenNLP, it would look something like

I have a Ph.D. in <START:skill> operations research <END>

For something as specific as this, you'd probably need to come up with that data yourself. And you'll need a lot of it; the OpenNLP documentation recommends about 15,000 example sentences. Consult the OpenNLP docs for more details.



回答2:

this post might help

OpenNLP: foreign names does not get recognized

It shows how to generate a model using a very new OpenNLP addon called "modelbuilder-addon"

you feed it a file of sentences, a file of known names, and tell it where to put the model. HTH



回答3:

One way you could do this would be to keep a list of known proper names, that can appear in documents. This would be also a good method for skills. When you recognize a named entity, you should check wether it appears on the list.

The other way would be to write your own component that extracts named entities which does a better job than OpenNLP, but it is probably much more difficult.



回答4:

I have heard people having good success with Apache UIMA for NER. There was a discussion about this just a day back here: how to use Entity Recognition with Apache solr and LingPipe or similar tools

It has few links you might want to have a look at.