Is there a Rust library with an UTF-16 string type

For most programs, it's better to use UTF-8 internally and, when necessary, convert to other encodings. But in my case, I want to write a Javascript interpreter, and it's much simpler to store only UTF-16 strings (or arrays of u16), because

I need to address 16-bits code units individually (this is a bad idea in general, but Javascript requires this). This means I need it to implement Index<usize>.
I need to store unpaired surrogates, that is, malformed UTF-16 strings (because of this, ECMAScript strings are technically defined as arrays of u16, that usually represent UTF-16 strings). There is an encoding aptly named WTF-8 to store unpaired surrogates in UTF-8, but I don't want to use something like this.

I want to have the usual owned / borrowed types (like String / str and CString / CStr) with all or most usual methods. I don't want to roll my own string type (if I can avoid).

Also, my strings will always be immutable, behind an Rc and referred from a data structure containing weak pointers to all strings (implementing string interning). This might be relevant: perhaps it would be better to have Rc<Utf16Str> as the string type, where Utf16Str is the unsized string type (which can be defined as just struct Utf16Str([u16])). That would avoid following two pointers when accessing the string, but I don't know how to instantiate an Rc with an unsized type.

Given the above requirements, merely using rust-encoding is very inconvenient, because it treats all non-UTF-8 encodings as vectors of u8.

Also, I'm not sure if using the std library at all might help me here. I looked into Utf16Units and it's just an iterator, not a proper string type. (also, I know OsString doesn't help - I'm not on Windows, and it doesn't even implement Index<usize>)

标签： string rust utf-16

1条回答

成全新的幸福

2楼-- · 2019-06-21 19:21

Since there are multiple questions here I’ll try to respond separately:

I think the types you want are [u16] and Vec<u16>.

The default string types str and String are wrappers around [u8] and Vec<u8> (not technically true of str which is primitive, but close enough). The point of having separate types is to maintain the invariant that the underlying bytes are well-formed in UTF-8.

Similarly, you could have Utf16Str and Utf16String types wrapping around [u16] and Vec<u16> that preserve a well-formed in UTF-16 invariant, namely that there is no unpaired surrogate.

But as you note in your question, JavaScript strings can contain unpaired surrogates. That’s because JavaScript strings are not strictly UTF-16, they really are arbitrary sequences of u16 with no additional invariant.

With no invariant to maintain, I don’t think wrapper types are all that useful.

rust-encoding supports UTF-16-LE and UTF-16-BE based on bytes. You probably want UTF-16 based on u16’s instead.

std::str::Utf16Units is indeed not a string type. It is an iterator returned by the str::utf16_units() method that converts a Rust string to UTF-16 (not LE or BE). You can use .collect() on that iterator to get a Vec<u16> for example.

The only safe way to obtain Rc<[u16]> is to coerce from Rc<[u16; N]> whose size is known at compile-time, which is obviously impractical. I wouldn’t recommend the unsafe way: allocating memory, writing a header to it that hopefully matches the memory representation of RcBox, and transmuting.

If you’re gonna do it with raw memory allocation, better use your own type so that you can use its private fields. Tendril does this: https://github.com/servo/tendril/blob/master/src/buf32.rs

Or, if you’re willing to take the cost of the extra indirection, Rc<Vec<u16>> is safe and much easier.

0人赞添加讨论(0) 举报

Is there a Rust library with an UTF-16 string type

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间