I'm working on porting some Delphi 7 code to XE4, so, unicode is the subject here.
I have a method where a string gets written to a TMemoryStream, so according to this embarcadero article, I should multiply the length of the string (in characters) times the size of the Char type to get the length in bytes that is needed for the length (in bytes) parameter to WriteBuffer.
so before:
rawHtml : string; //AnsiString
...
memorystream1.WriteBuffer(Pointer(rawHtml)^, Length(rawHtml);
after:
rawHtml : string; //UnicodeString
...
memorystream1.WriteBuffer(Pointer(rawHtml)^, Length(rawHtml)* SizeOf(Char));
My understanding of Delphi's UnicodeString type is that it's UTF-16 internally. But my general understanding of Unicode is that not all unicode characters can be represented even in 2 bytes, that some corner case foreign characters will take 4 bytes. Another of embarcadero's articles seems to confirm that my suspicions, "In fact, it isn’t even always true that one Char is equal to two bytes!"
So...that leaves me wondering whether Length(rawHtml)* SizeOf(Char)
is really going to be robust enough to be consistently accurate, or whether there's a better way to determine the size of the string that will be more accurate?
You are correct about UTF-16 encoding of Delphi's
UnicodeString
. This means what one 16-bit character is wide enough to represent all code points from the Basic Multilingual Plane as exactly oneChar
element ofstring
array.However, you've got a little misconception here.
Length
function does not perform any deep inspection of characters and simply returns number of 16-bitWideChar
elements, without taking into account any surrogates within your string. This means what if you assign a single character from any of Supplementary Planes to theUnicodeString
,Length
will return 2.Conclusion: byte size of string data is always fixed and equals
Length(S) * SizeOf(Char)
, no matter ifS
contains any variable-length characters.What you are doing is correct (with the sizeof(Char)).
What you refer to is that not one character refers to one code point (due to surrogate pairs for example). But the USC2 encoded (NOT UTF-16) characters in the string take up exactly the amount of bytes with
Length( Str ) * sizeof( Char )
.Note that the Unicode encoding used in Delphi is the same as all Windows API call expect in the ....W variants.
Delphi's
UnicodeString
is encoded with UTF-16. UTF-16 is a variable length encoding, just like UTF-8. In other words, a single Unicode code point may require multiple character elements to encode it. As a point of interest, the only fixed length Unicode encoding is UTF-32. The UTF-16 encoding uses 16 bit character elements, hence the name.In a Unicode Delphi,
Char
is an alias forWideChar
which is a UTF-16 character element. Andstring
is an alias forUnicodeString
, which is an array ofWideChar
elements. TheLength()
function returns the number of elements in the array.So,
SizeOf(Char)
is always 2 forUnicodeString
. Some Unicode code points are encoded with multiple character elements, orChar
s. ButLength()
returns the number of characters elements and not the number of code points. The character elements all have the same size. Sois correct.
Others have explained how UnicodeString is encoded and how to calculate its byte length. I just want to mention that the RTL already has such a function -
SysUtils.ByteLength()
: