How to modify TokenRegex rule in StanfordNLP?

2019-09-14 15:12发布

I have rule file for tokenregex as

$EDU_FIRST_KEYWORD = (/Education/|/Course[s]?/|/Educational/|/Academic/|/Education/ /and/?|/Professional/|/Certification[s]?/ /and/?)

$EDU_LAST_KEYWORD = (/Background/|/Qualification[s]?/|/Training[s]?/|/Detail[s]?/|/Record[s]?/) tokens = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$NamedEntityTagAnnotation" }

{ ruleType: "tokens", pattern: ( $EDU_FIRST_KEYWORD $EDU_LAST_KEYWORD ?), result: "EDUCATION" }

I want to match EDU_FIRST_KEYWORD followed by EDU_LAST_KEYWORD. If it does not match both parts, then check if EDU_FIRST_KEYWORD matches in given string.

E.g. 1. Training & Courses

Matched Output: EDUCATION (as it matched Courses, which should not happen)

Expected Output: no output

It is because it does not match either first part of string or complete string.

  1. Educational Background

Matched Output: EDUCATION

Expected Output: EDUCATION

I tried changing pattern: ( $EDU_FIRST_KEYWORD $EDU_LAST_KEYWORD ?) to pattern: ( $EDU_FIRST_KEYWORD + $EDU_LAST_KEYWORD ?) but it does not help.

I tried stanfordNLP tokenregex documentation, but could not get how to achieve this. Can somebody help me changing rule file? Thanks in advance.

1条回答
贪生不怕死
2楼-- · 2019-09-14 16:18

You want to use the matches() method of TokenSequenceMatcher to have your rule run against the entire String.

If you use find() it will search the entire string...if you use matches() it will see if the entire string matches the pattern.

At this time I am not sure if the TokensRegexAnnotator can perform full string matches on sentences, so you probably need to use some code like this:

package edu.stanford.nlp.examples;

import edu.stanford.nlp.util.*;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.ling.tokensregex.Env;
import edu.stanford.nlp.ling.tokensregex.TokenSequencePattern;
import edu.stanford.nlp.ling.tokensregex.TokenSequenceMatcher;
import edu.stanford.nlp.pipeline.*;

import java.util.*;

public class TokensRegexExactMatch {

  public static void main(String[] args) {
    Properties props = new Properties();
    props.setProperty("annotators", "tokenize");
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
    Annotation annotation = new Annotation("Training & Courses");
    pipeline.annotate(annotation);
    //System.err.println(IOUtils.stringFromFile("course.rules"));
    Env env = TokenSequencePattern.getNewEnv();
    env.bind("$EDU_WORD_ONE", "/Education|Educational|Courses/");
    env.bind("$EDU_WORD_TWO", "/Background|Qualification/");
    TokenSequencePattern pattern = TokenSequencePattern.compile(env, "$EDU_WORD_ONE $EDU_WORD_TWO?");
    List<CoreLabel> tokens = annotation.get(CoreAnnotations.TokensAnnotation.class);
    TokenSequenceMatcher matcher = pattern.getMatcher(tokens);
    // matcher.matches()
    while (matcher.find()) {
      System.err.println("---");
      String matchedString = matcher.group();
      List<CoreMap> matchedTokens = matcher.groupNodes();
      System.err.println(matchedTokens);
    }
  }
}
查看更多
登录 后发表回答