UTF-8 string delimiter

2020-04-10 02:15发布

I am parsing a binary protocol which has UTF-8 strings interspersed among raw bytes. This particular protocol prefaces each UTF-8 string with a short (two bytes) indicating the length of the following UTF-8 string. This gives a maximum string length 2^16 > 65 000 which is more than adequate for the particular application.

My question is, is this a standard way of delimiting UTF-8 strings?

标签: utf-8
3条回答
Lonely孤独者°
2楼-- · 2020-04-10 02:25

I wouldn't call that delimiting, more like "length prefixing". Some people call them Pascal strings since in the early days the language Pascal was one of the popular ones that stored strings that way in memory.

I don't think there's a formal standard specifically for just that, as it's a rather obvious way of storing UTF-8 strings (or any strings of bytes for that matter). It's defined over and over as a part of many standards that deal with messages that contain strings, though.

查看更多
SAY GOODBYE
3楼-- · 2020-04-10 02:33

i would use a delimiter which starts with 0x11...... but if you send raw bytes you will have to exclude this delimiter from the data\messages processed ,this means that if there is a user input similar to that delimiter, you will have to convert it.

if the user inputs any utf8 represented char you may simply send it as is.

查看更多
你好瞎i
4楼-- · 2020-04-10 02:50

UTF8 is not normally de-limited, you should be able to spot the multibyte characters in there by using the rules mentioned here: http://en.wikipedia.org/wiki/UTF-8#Description

查看更多
登录 后发表回答