How can we separate utf-8 characters into words if

2019-09-18 14:14发布

问题:

I make a program. The program got a utf8 string and split that into words. For latin characters, it's simple. Split based on space. For chinese character, it's also simple. Every character is a word.

What about if the strings are mixed?

What should I do?

I suppose I could detect whether the character is chinese or not, or whether the character is space separated words or nothing separated words.

What's the standard way to do this?

For example I want to split

Or perhaps I should split based on anything not alphanumeric (including other alpha numeric on non latin scripts and accents?). If so how should I proceed? Is there a regex for that match anything not alphanumeric, accented words, hebrew alibeth, arab abjad, and stuff?

I like horse into

I
Like
Horse

I want to split 北小金駅南口第1自転車駐車場 into

北
小
金
駅
南
...

Because each character in chinese is word.

What makes this problem tricky is that word split is different between chinese characters and western characters. Western characters are separated by space and chinese characters are separated by nothing.

I suppose we can detect whether the character is chinese or not first before we split. That would be fine but then, I don't know how to do so either.

回答1:

Use regular expressions - using a meta character like \b should capture all word boundary characters, whatever language is associated with them.

Regex.Split(myString, "\b", RegexOptions.None)


标签: vb.net utf-8