Splitting text to sentences and sentence to words:

2019-06-19 12:06发布

问题:

I accidentally answered a question where the original problem involved splitting sentence to separate words.

And the author suggested to use BreakIterator to tokenize input strings and some people liked this idea.

I just don't get that madness: how 25 lines of complicated code can be better than a simple one-liner with regexp?

Please, explain me the pros of using BreakIterator and the real cases when it should be used.

If it's really so cool and proper then I wonder: do you really use the approach with BreakIterator in your projects?

回答1:

From looking at the code posted at that answer, it looks like BreakIterator takes into consideration the language and locale of the text. Getting that level of support via regex will surely be a considerable pain. Perhaps that is the main reason it is preferred over a simple regex?



回答2:

The BreakIterator gives some nice explicit control and iterates cleanly in a nested way over each sentence and word. I'm not familiar with exactly what specifying the locale does for you, but I'm sure its quite helpful sometimes as well.

It didn't strike me as complicate at all. Just set up one iterator for the sentence level, another for the word level, nest the word one inside the second one.

If the problem changed into something different the solution you had on the other question might've just been out the window. However, that pattern of iterating through sentences and words can do a lot.

  1. Find the sentence where any word occurs the most repeated times. Output it along with that word
  2. Find the word used most times throughout the whole string.
  3. Find all words that occur in every sentence
  4. Find all words that occur a prime number of times in 2 or more sentences

The list goes on...