可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

Does anyone know of a Java library that handles finding sentence boundaries? I'm thinking that it would be a smart StringTokenizer implementation that knows about all of the sentence terminators that languages can use.

Here's my experience with BreakIterator:

Using the example here: I have the following Japanese:

今日はパソコンを買った。高性能のマックは早い！とても快適です。

In ascii, it looks like this:

\ufeff\u4eca\u65e5\u306f\u30d1\u30bd\u30b3\u30f3\u3092\u8cb7\u3063\u305f\u3002\u9ad8\u6027\u80fd\u306e\u30de\u30c3\u30af\u306f\u65e9\u3044\uff01\u3068\u3066\u3082\u5feb\u9069\u3067\u3059\u3002

Here's the part of that sample that I changed: static void sentenceExamples() {

  Locale currentLocale = new Locale ("ja","JP");
  BreakIterator sentenceIterator = 
     BreakIterator.getSentenceInstance(currentLocale);
  String someText = "今日はパソコンを買った。高性能のマックは早い！とても快適です。";

When I look at the Boundary indices, I see this:

0|13|24|32

But those indices don't correspond to any sentence terminators.

回答1:

You wrote:

I'm thinking that it would be a smart StringTokenizer implementation that knows about all of the sentence terminators that languages can use.

A basic problem here is that sentence terminators depend on the context, consider:

How did Dr. Jones compute 5! without recursion?

This should be recognized as a single sentence, but if you just split on possible sentence terminators you will get three sentences.

So this is a more complex problem than one might think in the beginning. It can be approached using machine learning techniques. You could for instance look into the OpenNLP project, in particular the SentenceDetectorME class.