I'm trying to split a sentence into words using Stanford coreNLP .
I'm having problem with words that contains apostrophe.
For example, the sentence:
I'm 24 years old.
Splits like this:
[I] ['m] [24] [years] [old]
Is it possible to split it like this using Stanford coreNLP?:
[I'm] [24] [years] [old]
I've tried using tokenize.whitespace, but it doesn't split on other punctuation marks like: '?' and ','
Currently, no. The subsequent Stanford CoreNLP processing tools all use Penn Treebank tokenization, which splits contractions into two tokens (regarding "I'm" as a reduced form of "I am" by making it the two "words" [I] ['m]). It sounds like you want a different type of tokenization.
While there are some tokenization options, there isn't one to change this, and subsequent tools (like the POS tagger or parser) would work badly without contractions being split. You could add such an option to the tokenizer, changing (deleting) the treatment of REDAUX and SREDAUX trailing contexts.
You can also join contractions via post processing as @dhg suggests, but you'd want to do it a little more carefully in the "if" so it didn't join on quotes.
How about if you just re-concatenate tokens that are split by an apostrophe?
Here's an implementation in Java:
public static List<String> tokenize(String s) {
PTBTokenizer<CoreLabel> ptbt = new PTBTokenizer<CoreLabel>(
new StringReader(s), new CoreLabelTokenFactory(), "");
List<String> sentence = new ArrayList<String>();
StringBuilder sb = new StringBuilder();
for (CoreLabel label; ptbt.hasNext();) {
label = ptbt.next();
String word = label.word();
if (word.startsWith("'")) {
sb.append(word);
} else {
if (sb.length() > 0)
sentence.add(sb.toString());
sb = new StringBuilder();
sb.append(word);
}
}
if (sb.length() > 0)
sentence.add(sb.toString());
return sentence;
}
public static void main(String[] args) {
System.out.println(tokenize("I'm 24 years old.")); // [I'm, 24, years, old, .]
}
There are possessives and contractions. Your example is a contraction. Just looking for an apostrophe won't find you the difference between the two. "This is Pete's answer. I'm sure you knew that." In these two sentences we have one of each case.
With the part of speech tags we can tell the difference. With the tree surgeon syntax you can assemble those, change them and so forth. The syntax is listed here: http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/trees/tregex/tsurgeon/package-summary.html. I've found tree surgeon to be really useful in pulling apart NP groups as I like to break them up over conjunctions.
Alternatively, does 'm stem to "am"? You might want to look for those and look for it's stem tag and simply revert it to that value. Stemming is extremely useful in many other aspects of machine learning and analysis.