I am testing migration from Delphi 5 to XE. Being unfamiliar with UnicodeString, before asking my question I would like to present its background.
Delphi XE string-oriented functions: Copy, Delete and Insert have a parameter Index telling where the operation should start. Index may have any integer value starting from 1 and finishing at the length of the string to which the function is applied. Since the string can have multi-element characters, function operation can start at an element (surrogate) belonging to a multi-element series encoding a single unicode named code-point. Then, having a sensible string and using one of the functions, we can obtain non sensible result.
The phenomenon can be illustrated with the below cases using the function Copy with respect to strings representing the same array of named codepoints (i.e. meaningful signs)
($61, $13000, $63)
It's concatenation of 'a'
, EGYPTIAN_HIEROGLYPH_A001
and 'c'
; it looks as
Case 1. Copy of AnsiString (element = byte)
We start with the above mentioned UnicodeString #$61#$13000#$63
and we convert it to UTF-8 encoded AnsiString s0
.
Then we test the function
copy (s0, index, 1)
for all possible values of index; there are 6 of them since s0
is 6 bytes long.
procedure Copy_Utf8Test;
type TAnsiStringUtf8 = type AnsiString (CP_UTF8);
var ss : string;
s0,s1 : TAnsiStringUtf8;
ii : integer;
begin
ss := #$61#$13000#$63; //mem dump of ss: $61 $00 $0C $D8 $00 $DC $63 $00
s0 := ss; //mem dump of s0: $61 $F0 $93 $80 $80 $63
ii := length(s0); //sets ii=6 (bytes)
s1 := copy(s0,1,1); //'a'
s1 := copy(s0,2,1); //#$F0 F means "start of 4-byte series"; no corresponding named code-point
s1 := copy(s0,3,1); //#$93 "trailing in multi-byte series"; no corresponding named code-point
s1 := copy(s0,4,1); //#$80 "trailing in multi-byte series"; no corresponding named code-point
s1 := copy(s0,5,1); //#$80 "trailing in multi-byte series"; no corresponding named code-point
s1 := copy(s0,6,1); //'c'
end;
The first and last results are sensible within UTF-8 codepage, while the other 4 are not.
Case 2. Copy of UnicodeString (element = word)
We start with the same UnicodeString s0 := #$61#$13000#$63
.
Then we test the function
copy (s0, index, 1)
for all possible values of index; there are 4 of them since s0
is 4 words long.
procedure Copy_Utf16Test;
var s0,s1 : string;
ii : integer;
begin
s0 := #$61#$13000#$63; //mem dump of s0: $61 $00 $0C $D8 $00 $DC $63 $00
ii := length(s0); //sets ii=4 (bytes)
s1 := copy(s0,1,1); //'a'
s1 := copy(s0,2,1); //#$D80C surrogate pair member; no corresponding named code-point
s1 := copy(s0,3,1); //#$DC00 surrogate pair member; no corresponding named code-point
s1 := copy(s0,4,1); //'c'
end;
The first and last results are sensible within codepage CP_UNICODE
(1200), while the other 2 are not.
Conclusion.
The string-oriented functions: Copy, Delete and Insert perfectly operate on string considered as a mere array of bytes or words. But they are not helpful if string is seen as that what it essentially is, i.e. representation of array of named code-points.
Both above two cases deal with strings which represent the same array of 3 named code-points. They are considered as representations (encodings) of the same text composed of 3 meaningful signs (to avoid abuse of the term "characters").
One may want to be able to extract (copy) any of those meaningful signs regardless whether a particular text representation (encoding) is mono- or multi-element one. I've spent quite a time looking around for a satisfactory equivalent of Copy that I used to in Delphi 5.
Question. Do such equivalents exist or I have to write them myself?
What you have described is how
Copy()
,Delete()
, andInsert()
have ALWAYS worked, even forAnsiString
. The functions operate on elements (ie codeunits in Unicode terminology), and always have.AnsiString
is a string of 8bitAnsiChar
elements, which can be encoded in any 8bit ANSI/MBCS format, including UTF-8.UnicodeString
(andWideString
) is a string of 16bitWideChar
elements, which are encoded in UTF-16.The functions HAVE NEVER taken encoding into account. Not for MBCS
AnsiString
. Not for UTF-16UnicodeString
. Indexes are absolute element indexes from the beginning of the string.If you need encoding-aware
Copy
/Delete
/Insert
functions that operate on logical codepoint boundaries, where each codepoint may be 1+ elements in the string, then you have to write your own functions, or find third-party functions that do what you need. There is no MBCS/UTF-aware mutilator functions in the RTL.You should parse Unicode string youself. Fortunaly the Unicode encoding is designed to make parsing easy. Here is an example how to parse UTF8 string: