Java substring broken encoding

I read some data from stream in UTF-8 encoding

String line = new String(byteArray, "UTF-8");

then try to find some subsequence

int startPos = line.indexOf(tag) + tag.length();
int endPos   = line.indexOf("/", startPos);

and cut it

String name = line.substring(startPos, endPos);

In most cases it works fine, but some times result is broken. For example, for input name like "гордунни" I got values like "горд��нни", "горду��ни", "г��рдунни" etc. It seems like surrogate pairs are randomly broken for some reason. I got it 4 times out of 1000.

How to fix it? Do I need to use other String methods instead of indexOf()+substring() or to use some encoding/decoding magic on my result?

标签： java utf-8 substring

2条回答

老娘就宠你

2楼-- · 2019-04-30 18:23

In order to get this out of the 'Unanswered' queue.

The problem occurs because the stream was read as chunks of bytes, sometimes splitting multi-byte UTF-8 characters.

By wrapping the InputStream in an InputStreamReader, you will read chunks of characters (as opposed to chunks of bytes), and multi-byte UTF-8 characters will survive.

0人赞添加讨论(0) 举报

够拽才男人

3楼-- · 2019-04-30 18:29

In your example, can you show the content of byteArray, of line and of tag? Can you also show what length will be obtained, what startPos and what endPos? I mean, within the string "гордунни" there is no "/"! And why do you calculate the endPos? What is the string inside tag? Are you sure substring's second parameter is the endpos and not the length? It is true that "гордунни" needs no surrogate pairs because all codepoints are below 0xFFFF, but once somewhere in your utf-16 string there is at least one surrogate-pair, i bet the length of the string will give you the number of word elements and not the number of codepoints. I am not sure about Java, but in C# length gives you the number of elements. To get the number of characters/codepoints you'll have to use the StringInfo class in C#. Check also if you'll have some BOM in your string. What is

String line = new String(byteArray, "UTF-8");

doing? Is the byte array an utf-8 encoded string getting transformed to utf-16? Does it contain a utf-8 BOM? Does the string afterwards have a utf-16LE or utf-16BE BOM?

0人赞添加讨论(0) 举报

Java substring broken encoding

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间