I am trying to parse utf-8 strings into "bite sized" segments. For example, I would like to break down a text into "sentences".
Is there a comprehensive collection of characters (or regex) that correspond to end of sentences in all languages? I'm looking for something that would capture the Latin period, exclamation and interrogation marks, the Chinese and Japanese full stop, etc.
Something like the above but for the equivalent of a comma would be great too.
You need to look at code points with the
\p{Sentence_Break=STerm}
or\p{Sentence_Break=ATerm}
properties that also have the\p{Terminal_Punctuation}
property. Running the unichars script against Unicode v6.1, we learn that these code points meet all those criteria:I haven’t encountered any compilations of such information, and I would expect it to be a major effort to collect it. For some widely used languages, you could get the information from The Chicago Manual of Style. There is some information about punctuation marks commonly used in different languages at http://unicode.org/repos/cldr-tmp/trunk/diff/by_type/misc.exemplarCharacters-other.html but is covers just a small set of languages and does not distinguish sentence-terminating characters.
Using just characters won’t be enough, since e.g. in English, the full stop “.” occurs in many contexts where it does not terminate a sentence, as in “e.g.” or in “1.5”.
Chinese, Japanese and Korean use
。
. Thai uses a space. See this list of Unicode full stop equivalents.