I'm reading a Unicode stream and would rather not have to pass the entire string through a regex. Is there a simple (reliable) character I can use to break words across languages?
My byte array is likely going to be based in UTF-16 or UTF-8
I'm reading a Unicode stream and would rather not have to pass the entire string through a regex. Is there a simple (reliable) character I can use to break words across languages?
My byte array is likely going to be based in UTF-16 or UTF-8
If you are using Java then you can use the BreakIterator.