For most programs, it's better to use UTF-8 internally and, when necessary, convert to other encodings. But in my case, I want to write a Javascript interpreter, and it's much simpler to store only UTF-16 strings (or arrays of u16
), because
I need to address 16-bits code units individually (this is a bad idea in general, but Javascript requires this). This means I need it to implement
Index<usize>
.I need to store unpaired surrogates, that is, malformed UTF-16 strings (because of this, ECMAScript strings are technically defined as arrays of
u16
, that usually represent UTF-16 strings). There is an encoding aptly named WTF-8 to store unpaired surrogates in UTF-8, but I don't want to use something like this.
I want to have the usual owned / borrowed types (like String
/ str
and CString
/ CStr
) with all or most usual methods. I don't want to roll my own string type (if I can avoid).
Also, my strings will always be immutable, behind an Rc
and referred from a data structure containing weak pointers to all strings (implementing string interning). This might be relevant: perhaps it would be better to have Rc<Utf16Str>
as the string type, where Utf16Str
is the unsized string type (which can be defined as just struct Utf16Str([u16])
). That would avoid following two pointers when accessing the string, but I don't know how to instantiate an Rc
with an unsized type.
Given the above requirements, merely using rust-encoding is very inconvenient, because it treats all non-UTF-8 encodings as vectors of u8
.
Also, I'm not sure if using the std library at all might help me here. I looked into Utf16Units
and it's just an iterator, not a proper string type. (also, I know OsString
doesn't help - I'm not on Windows, and it doesn't even implement Index<usize>
)
Since there are multiple questions here I’ll try to respond separately:
I think the types you want are
[u16]
andVec<u16>
.The default string types
str
andString
are wrappers around[u8]
andVec<u8>
(not technically true ofstr
which is primitive, but close enough). The point of having separate types is to maintain the invariant that the underlying bytes are well-formed in UTF-8.Similarly, you could have
Utf16Str
andUtf16String
types wrapping around[u16]
andVec<u16>
that preserve a well-formed in UTF-16 invariant, namely that there is no unpaired surrogate.But as you note in your question, JavaScript strings can contain unpaired surrogates. That’s because JavaScript strings are not strictly UTF-16, they really are arbitrary sequences of
u16
with no additional invariant.With no invariant to maintain, I don’t think wrapper types are all that useful.
rust-encoding supports UTF-16-LE and UTF-16-BE based on bytes. You probably want UTF-16 based on
u16
’s instead.std::str::Utf16Units
is indeed not a string type. It is an iterator returned by thestr::utf16_units()
method that converts a Rust string to UTF-16 (not LE or BE). You can use.collect()
on that iterator to get aVec<u16>
for example.The only safe way to obtain
Rc<[u16]>
is to coerce fromRc<[u16; N]>
whose size is known at compile-time, which is obviously impractical. I wouldn’t recommend the unsafe way: allocating memory, writing a header to it that hopefully matches the memory representation ofRcBox
, and transmuting.If you’re gonna do it with raw memory allocation, better use your own type so that you can use its private fields. Tendril does this: https://github.com/servo/tendril/blob/master/src/buf32.rs
Or, if you’re willing to take the cost of the extra indirection,
Rc<Vec<u16>>
is safe and much easier.