I have rule file for tokenregex as
$EDU_FIRST_KEYWORD = (/Education/|/Course[s]?/|/Educational/|/Academic/|/Education/ /and/?|/Professional/|/Certification[s]?/ /and/?)
$EDU_LAST_KEYWORD = (/Background/|/Qualification[s]?/|/Training[s]?/|/Detail[s]?/|/Record[s]?/)
tokens = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$NamedEntityTagAnnotation" }
{ ruleType: "tokens", pattern: ( $EDU_FIRST_KEYWORD $EDU_LAST_KEYWORD ?),
result: "EDUCATION"
}
I want to match EDU_FIRST_KEYWORD
followed by EDU_LAST_KEYWORD
. If it does not match both parts, then check if EDU_FIRST_KEYWORD
matches in given string.
E.g. 1. Training & Courses
Matched Output: EDUCATION (as it matched Courses, which should not happen)
Expected Output: no output
It is because it does not match either first part of string or complete string.
- Educational Background
Matched Output: EDUCATION
Expected Output: EDUCATION
I tried changing pattern: ( $EDU_FIRST_KEYWORD $EDU_LAST_KEYWORD ?)
to
pattern: ( $EDU_FIRST_KEYWORD + $EDU_LAST_KEYWORD ?)
but it does not help.
I tried stanfordNLP tokenregex documentation, but could not get how to achieve this. Can somebody help me changing rule file? Thanks in advance.
You want to use the
matches()
method of TokenSequenceMatcher to have your rule run against the entire String.If you use
find()
it will search the entire string...if you usematches()
it will see if the entire string matches the pattern.At this time I am not sure if the TokensRegexAnnotator can perform full string matches on sentences, so you probably need to use some code like this: