Does anyone know of a Java library that handles finding sentence boundaries? I'm thinking that it would be a smart StringTokenizer implementation that knows about all of the sentence terminators that languages can use.
Here's my experience with BreakIterator:
Using the example here: I have the following Japanese:
In ascii, it looks like this:
Here's the part of that sample that I changed: static void sentenceExamples() {
Locale currentLocale = new Locale ("ja","JP");
BreakIterator sentenceIterator =
String someText = "今日はパソコンを買った。高性能のマックは早い!とても快適です。";
When I look at the Boundary indices, I see this:
But those indices don't correspond to any sentence terminators.