How to modify TokenRegex rule in StanfordNLP?

I have rule file for tokenregex as

$EDU_LAST_KEYWORD = (/Background/|/Qualification[s]?/|/Training[s]?/|/Detail[s]?/|/Record[s]?/) tokens = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$NamedEntityTagAnnotation" }

{ ruleType: "tokens", pattern: ( $EDU_FIRST_KEYWORD $EDU_LAST_KEYWORD ?), result: "EDUCATION" }

I want to match EDU_FIRST_KEYWORD followed by EDU_LAST_KEYWORD. If it does not match both parts, then check if EDU_FIRST_KEYWORD matches in given string.

E.g. 1. Training & Courses

Matched Output: EDUCATION (as it matched Courses, which should not happen)

Expected Output: no output

It is because it does not match either first part of string or complete string.

Educational Background

Matched Output: EDUCATION

Expected Output: EDUCATION

I tried changing pattern: ( $EDU_FIRST_KEYWORD $EDU_LAST_KEYWORD ?) to pattern: ( $EDU_FIRST_KEYWORD + $EDU_LAST_KEYWORD ?) but it does not help.

I tried stanfordNLP tokenregex documentation, but could not get how to achieve this. Can somebody help me changing rule file? Thanks in advance.

标签： java stanford-nlp

1条回答

贪生不怕死

2楼-- · 2019-09-14 16:18

You want to use the matches() method of TokenSequenceMatcher to have your rule run against the entire String.

If you use find() it will search the entire string...if you use matches() it will see if the entire string matches the pattern.

At this time I am not sure if the TokensRegexAnnotator can perform full string matches on sentences, so you probably need to use some code like this:

package edu.stanford.nlp.examples;

import edu.stanford.nlp.util.*;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.ling.tokensregex.Env;
import edu.stanford.nlp.ling.tokensregex.TokenSequencePattern;
import edu.stanford.nlp.ling.tokensregex.TokenSequenceMatcher;
import edu.stanford.nlp.pipeline.*;

import java.util.*;

public class TokensRegexExactMatch {

  public static void main(String[] args) {
    Properties props = new Properties();
    props.setProperty("annotators", "tokenize");
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
    Annotation annotation = new Annotation("Training & Courses");
    pipeline.annotate(annotation);
    //System.err.println(IOUtils.stringFromFile("course.rules"));
    Env env = TokenSequencePattern.getNewEnv();
    env.bind("$EDU_WORD_ONE", "/Education|Educational|Courses/");
    env.bind("$EDU_WORD_TWO", "/Background|Qualification/");
    TokenSequencePattern pattern = TokenSequencePattern.compile(env, "$EDU_WORD_ONE $EDU_WORD_TWO?");
    List<CoreLabel> tokens = annotation.get(CoreAnnotations.TokensAnnotation.class);
    TokenSequenceMatcher matcher = pattern.getMatcher(tokens);
    // matcher.matches()
    while (matcher.find()) {
      System.err.println("---");
      String matchedString = matcher.group();
      List<CoreMap> matchedTokens = matcher.groupNodes();
      System.err.println(matchedTokens);
    }
  }
}

0人赞添加讨论(0) 举报

How to modify TokenRegex rule in StanfordNLP?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间