I read some data from stream in UTF-8 encoding
String line = new String(byteArray, "UTF-8");
then try to find some subsequence
int startPos = line.indexOf(tag) + tag.length();
int endPos = line.indexOf("/", startPos);
and cut it
String name = line.substring(startPos, endPos);
In most cases it works fine, but some times result is broken. For example, for input name like "гордунни"
I got values like "горд��нни"
, "горду��ни"
, "г��рдунни"
etc.
It seems like surrogate pairs are randomly broken for some reason. I got it 4 times out of 1000.
How to fix it? Do I need to use other String methods instead of indexOf()+substring() or to use some encoding/decoding magic on my result?
In order to get this out of the 'Unanswered' queue.
The problem occurs because the stream was read as chunks of bytes, sometimes splitting multi-byte UTF-8 characters.
By wrapping the InputStream in an InputStreamReader, you will read chunks of characters (as opposed to chunks of bytes), and multi-byte UTF-8 characters will survive.
In your example, can you show the content of byteArray, of line and of tag? Can you also show what length will be obtained, what startPos and what endPos? I mean, within the string "гордунни" there is no "/"! And why do you calculate the endPos? What is the string inside tag? Are you sure substring's second parameter is the endpos and not the length? It is true that "гордунни" needs no surrogate pairs because all codepoints are below 0xFFFF, but once somewhere in your utf-16 string there is at least one surrogate-pair, i bet the length of the string will give you the number of word elements and not the number of codepoints. I am not sure about Java, but in C# length gives you the number of elements. To get the number of characters/codepoints you'll have to use the StringInfo class in C#. Check also if you'll have some BOM in your string. What is
String line = new String(byteArray, "UTF-8");
doing? Is the byte array an utf-8 encoded string getting transformed to utf-16? Does it contain a utf-8 BOM? Does the string afterwards have a utf-16LE or utf-16BE BOM?