For most programs, it's better to use UTF-8 internally and, when necessary, convert to other encodings. But in my case, I want to write a Javascript interpreter, and it's much simpler to store only UTF-16 strings (or arrays of u16
), because
I need to address 16-bits code units individually (this is a bad idea in general, but Javascript requires this). This means I need it to implement
Index<usize>
.I need to store unpaired surrogates, that is, malformed UTF-16 strings (because of this, ECMAScript strings are technically defined as arrays of
u16
, that usually represent UTF-16 strings). There is an encoding aptly named WTF-8 to store unpaired surrogates in UTF-8, but I don't want to use something like this.
I want to have the usual owned / borrowed types (like String
/ str
and CString
/ CStr
) with all or most usual methods. I don't want to roll my own string type (if I can avoid).
Also, my strings will always be immutable, behind an Rc
and referred from a data structure containing weak pointers to all strings (implementing string interning). This might be relevant: perhaps it would be better to have Rc<Utf16Str>
as the string type, where Utf16Str
is the unsized string type (which can be defined as just struct Utf16Str([u16])
). That would avoid following two pointers when accessing the string, but I don't know how to instantiate an Rc
with an unsized type.
Given the above requirements, merely using rust-encoding is very inconvenient, because it treats all non-UTF-8 encodings as vectors of u8
.
Also, I'm not sure if using the std library at all might help me here. I looked into Utf16Units
and it's just an iterator, not a proper string type. (also, I know OsString
doesn't help - I'm not on Windows, and it doesn't even implement Index<usize>
)