Stanford coreNLP - split words ignoring apostrophe

I'm trying to split a sentence into words using Stanford coreNLP . I'm having problem with words that contains apostrophe.

For example, the sentence: I'm 24 years old.

Splits like this: [I] ['m] [24] [years] [old]

Is it possible to split it like this using Stanford coreNLP?: [I'm] [24] [years] [old]

I've tried using tokenize.whitespace, but it doesn't split on other punctuation marks like: '?' and ','

标签： nlp stanford-nlp

3条回答

戒情不戒烟

2楼-- · 2020-02-13 03:48

Currently, no. The subsequent Stanford CoreNLP processing tools all use Penn Treebank tokenization, which splits contractions into two tokens (regarding "I'm" as a reduced form of "I am" by making it the two "words" [I] ['m]). It sounds like you want a different type of tokenization.

While there are some tokenization options, there isn't one to change this, and subsequent tools (like the POS tagger or parser) would work badly without contractions being split. You could add such an option to the tokenizer, changing (deleting) the treatment of REDAUX and SREDAUX trailing contexts.

You can also join contractions via post processing as @dhg suggests, but you'd want to do it a little more carefully in the "if" so it didn't join on quotes.

0人赞添加讨论(0) 举报

聊天终结者

3楼-- · 2020-02-13 03:48

How about if you just re-concatenate tokens that are split by an apostrophe?

Here's an implementation in Java:

public static List<String> tokenize(String s) {
    PTBTokenizer<CoreLabel> ptbt = new PTBTokenizer<CoreLabel>(
            new StringReader(s), new CoreLabelTokenFactory(), "");
    List<String> sentence = new ArrayList<String>();
    StringBuilder sb = new StringBuilder();
    for (CoreLabel label; ptbt.hasNext();) {
        label = ptbt.next();
        String word = label.word();
        if (word.startsWith("'")) {
            sb.append(word);
        } else {
            if (sb.length() > 0)
                sentence.add(sb.toString());
            sb = new StringBuilder();
            sb.append(word);
        }
    }
    if (sb.length() > 0)
        sentence.add(sb.toString());
    return sentence;
}

public static void main(String[] args) {
    System.out.println(tokenize("I'm 24 years old."));  // [I'm, 24, years, old, .]
}

0人赞添加讨论(0) 举报

够拽才男人

4楼-- · 2020-02-13 03:54

There are possessives and contractions. Your example is a contraction. Just looking for an apostrophe won't find you the difference between the two. "This is Pete's answer. I'm sure you knew that." In these two sentences we have one of each case.

With the part of speech tags we can tell the difference. With the tree surgeon syntax you can assemble those, change them and so forth. The syntax is listed here: http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/trees/tregex/tsurgeon/package-summary.html. I've found tree surgeon to be really useful in pulling apart NP groups as I like to break them up over conjunctions.

Alternatively, does 'm stem to "am"? You might want to look for those and look for it's stem tag and simply revert it to that value. Stemming is extremely useful in many other aspects of machine learning and analysis.

0人赞添加讨论(0) 举报

Stanford coreNLP - split words ignoring apostrophe

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间