Which settings should be used for TokensregexNER

2019-05-10 19:30发布

问题:

When I try regexner it works as expected with the following settings and data;

props.setProperty("annotators", "tokenize, cleanxml, ssplit, pos, lemma, regexner");

Bachelor of Laws DEGREE
Bachelor of (Arts|Laws|Science|Engineering|Divinity) DEGREE

What I would like to do is that using TokenRegex. For example

Bachelor of Laws DEGREE
Bachelor of ([{tag:NNS}] [{tag:NNP}]) DEGREE

I read that to do this, I should use TokensregexNERAnnotator.

I tried to use it as follows, but it did not work.

Pipeline.addAnnotator(new TokensRegexNERAnnotator("expressions.txt", true));

Or I tried setting annotator in another way,

props.setProperty("annotators", "tokenize, cleanxml, ssplit, pos, lemma, tokenregexner");    
props.setProperty("customAnnotatorClass.tokenregexner", "edu.stanford.nlp.pipeline.TokensRegexNERAnnotator");

I tried to different TokenRegex formats but either annotator could not find the expression or I got SyntaxException.

What is the proper way to use TokenRegex (query on tokens with tags) on NER data file ?

BTW I just see a comment in TokensRegexNERAnnotator.java file. Not sure if it is related pos tags does not work with RegexNerAnnotator.

if (entry.tokensRegex != null) {
    // TODO: posTagPatterns...
    pattern = TokenSequencePattern.compile(env, entry.tokensRegex);
  }

回答1:

First you need to make a TokensRegex rule file (sample_degree.rules). Here is an example:

ner = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$NamedEntityTagAnnotation" }

{ pattern: (/Bachelor/ /of/ [{tag:NNP}]), action: Annotate($0, ner, "DEGREE") }

To explain the rule a bit, the pattern field is specifying what type of pattern to match. The action field is saying to annotate every token in the overall match (that is what $0 represents), annotate the ner field (note that we specified ner = ... in the rule file as well, and the third parameter is saying set the field to the String "DEGREE").

Then make this .props file (degree_example.props) for the command:

customAnnotatorClass.tokensregex = edu.stanford.nlp.pipeline.TokensRegexAnnotator

tokensregex.rules = sample_degree.rules

annotators = tokenize,ssplit,pos,lemma,ner,tokensregex

Then run this command:

java -Xmx8g edu.stanford.nlp.pipeline.StanfordCoreNLP -props degree_example.props -file sample-degree-sentence.txt -outputFormat text

You should see that the three tokens you wanted tagged as "DEGREE" will be tagged.

I think I will push a change to the code to make tokensregex link to the TokensRegexAnnotator so you won't have to specify it as a custom annotator. But for now you need to add that line in the .props file.

This example should help in implementing this. Here are some more resources if you want to learn more:

http://nlp.stanford.edu/software/tokensregex.shtml#TokensRegexRules

http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/ling/tokensregex/SequenceMatchRules.html

http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/ling/tokensregex/types/Expressions.html