UTF-8 string delimiter

2020-04-10 02:15发布

I am parsing a binary protocol which has UTF-8 strings interspersed among raw bytes. This particular protocol prefaces each UTF-8 string with a short (two bytes) indicating the length of the following UTF-8 string. This gives a maximum string length 2^16 > 65 000 which is more than adequate for the particular application.

My question is, is this a standard way of delimiting UTF-8 strings?

标签： utf-8

3条回答

Lonely孤独者°

2楼-- · 2020-04-10 02:25

I wouldn't call that delimiting, more like "length prefixing". Some people call them Pascal strings since in the early days the language Pascal was one of the popular ones that stored strings that way in memory.

I don't think there's a formal standard specifically for just that, as it's a rather obvious way of storing UTF-8 strings (or any strings of bytes for that matter). It's defined over and over as a part of many standards that deal with messages that contain strings, though.

0人赞添加讨论(0) 举报

SAY GOODBYE

3楼-- · 2020-04-10 02:33

i would use a delimiter which starts with 0x11...... but if you send raw bytes you will have to exclude this delimiter from the data\messages processed ,this means that if there is a user input similar to that delimiter, you will have to convert it.

if the user inputs any utf8 represented char you may simply send it as is.

0人赞添加讨论(0) 举报

你好瞎i

4楼-- · 2020-04-10 02:50

UTF8 is not normally de-limited, you should be able to spot the multibyte characters in there by using the rules mentioned here: http://en.wikipedia.org/wiki/UTF-8#Description

0人赞添加讨论(0) 举报

UTF-8 string delimiter

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间