I want to add addresses (and possibly other rules based entities) to an NER pipeline and the Tokens Regex seems like a terribly useful DSL for doing so. Following https://stackoverflow.com/a/42604225, I'm created this rules file:
ner = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$NamedEntityTagAnnotation" }
{ pattern: ([{ner:"NUMBER"}] [{pos:"NN"}|{pos:"NNP"}] /ave(nue)?|st(reet)?|boulevard|blvd|r(oa)?d/), action: Annotate($0, ner, "address") }
Here's a scala repl session showing how I'm trying to set up an annotation pipeline.
@ import edu.stanford.nlp.pipeline.{StanfordCoreNLP, CoreDocument}
@ import edu.stanford.nlp.util.PropertiesUtils.asProperties
@ val pipe = new StanfordCoreNLP(asProperties(
"customAnnotatorClass.tokensregex", "edu.stanford.nlp.pipeline.TokensRegexAnnotator",
"annotators", "tokenize,ssplit,pos,lemma,ner,tokensregex",
"ner.combinationMode", "HIGH_RECALL",
"tokensregex.rules", "addresses.tregx"))
pipe: StanfordCoreNLP = edu.stanford.nlp.pipeline.StanfordCoreNLP@2ce6a051
@ val doc = new CoreDocument("Adam Smith lived at 123 noun street in Glasgow, Scotland")
doc: CoreDocument = Adam Smith lived at 123 noun street in Glasgow, Scotland
@ pipe.annotate(doc)
@ doc.sentences.get(0).nerTags
res5: java.util.List[String] = [PERSON, PERSON, O, O, address, address, address, O, CITY, O, COUNTRY]
@ doc.entityMentions
res6: java.util.List[edu.stanford.nlp.pipeline.CoreEntityMention] = [Adam Smith, 123, Glasgow, Scotland]
As you can see, the address gets correctly tagged in the nerTags
for the sentence, but it doesn't show up in the documents entityMentions
. Is there a way to do this?
Also, is there a way from the document to discern two adjacent matches of the tokenregex from a single match (assuming I have more complicated set of regexes; in the current example I only match exactly 3 tokens, so I could just count tokens)?
I tried approaching it using the regexner
with a tokens regex described here https://stanfordnlp.github.io/CoreNLP/regexner.html, but I couldn't seem to get that working.
Since I'm working in scala I'll be happy to dive into the Java API to get this to work, rather than fiddle with properties and resource files, if that's necessary.
Yes, I've recently added some changes (in the GitHub version) to make this easier! Make sure to download the latest version from GitHub. Though we are aiming to release Stanford CoreNLP 3.9.2 fairly soon and it will have these changes.
If you read this page you can get an understanding of the full NER pipeline run by the NERCombinerAnnotator.
https://stanfordnlp.github.io/CoreNLP/ner.html
Furthermore there is a lot of write up on the TokensRegex here:
https://stanfordnlp.github.io/CoreNLP/tokensregex.html
Basically what you want to do is run the
ner
annotator, and use it'sTokensRegex
sub-annotator. Imagine you have some named entity rules in a file calledmy_ner.rules
.You could run a command like this:
This will run a
TokensRegex
sub-annotator during the full named entity recognition process. Then when the final step of entity mentions are run, it will operate on the rules extracted named entities and create entity mentions from them.