How manipulate substrings, and not subarrays, of U

I am testing migration from Delphi 5 to XE. Being unfamiliar with UnicodeString, before asking my question I would like to present its background.

Delphi XE string-oriented functions: Copy, Delete and Insert have a parameter Index telling where the operation should start. Index may have any integer value starting from 1 and finishing at the length of the string to which the function is applied. Since the string can have multi-element characters, function operation can start at an element (surrogate) belonging to a multi-element series encoding a single unicode named code-point. Then, having a sensible string and using one of the functions, we can obtain non sensible result.

The phenomenon can be illustrated with the below cases using the function Copy with respect to strings representing the same array of named codepoints (i.e. meaningful signs)

  ($61, $13000, $63)

It's concatenation of 'a', EGYPTIAN_HIEROGLYPH_A001 and 'c'; it looks as

Case 1. Copy of AnsiString (element = byte)

We start with the above mentioned UnicodeString #$61#$13000#$63 and we convert it to UTF-8 encoded AnsiString s0.

Then we test the function

  copy (s0, index, 1)

for all possible values of index; there are 6 of them since s0 is 6 bytes long.

    procedure Copy_Utf8Test;
    type TAnsiStringUtf8 = type AnsiString (CP_UTF8);
    var ss    : string;
        s0,s1 : TAnsiStringUtf8;
        ii    : integer;
    begin
      ss := #$61#$13000#$63; //mem dump of ss: $61 $00 $0C $D8 $00 $DC $63 $00
      s0 := ss;              //mem dump of s0: $61 $F0 $93 $80 $80 $63
      ii := length(s0);      //sets ii=6 (bytes)
      s1 := copy(s0,1,1);    //'a'
      s1 := copy(s0,2,1);    //#$F0  F means "start of 4-byte series"; no corresponding named code-point
      s1 := copy(s0,3,1);    //#$93  "trailing in multi-byte series"; no corresponding named code-point
      s1 := copy(s0,4,1);    //#$80  "trailing in multi-byte series"; no corresponding named code-point
      s1 := copy(s0,5,1);    //#$80  "trailing in multi-byte series"; no corresponding named code-point
      s1 := copy(s0,6,1);    //'c'
    end;

The first and last results are sensible within UTF-8 codepage, while the other 4 are not.

Case 2. Copy of UnicodeString (element = word)

We start with the same UnicodeString s0 := #$61#$13000#$63.

Then we test the function

  copy (s0, index, 1)

for all possible values of index; there are 4 of them since s0 is 4 words long.

    procedure Copy_Utf16Test;
    var s0,s1 : string;
        ii    : integer;
    begin
      s0 := #$61#$13000#$63; //mem dump of s0: $61 $00 $0C $D8 $00 $DC $63 $00
      ii := length(s0);      //sets ii=4 (bytes)
      s1 := copy(s0,1,1);    //'a'
      s1 := copy(s0,2,1);    //#$D80C surrogate pair member; no corresponding named code-point
      s1 := copy(s0,3,1);    //#$DC00 surrogate pair member; no corresponding named code-point
      s1 := copy(s0,4,1);    //'c'
    end;

The first and last results are sensible within codepage CP_UNICODE (1200), while the other 2 are not.

Conclusion.

The string-oriented functions: Copy, Delete and Insert perfectly operate on string considered as a mere array of bytes or words. But they are not helpful if string is seen as that what it essentially is, i.e. representation of array of named code-points.

Both above two cases deal with strings which represent the same array of 3 named code-points. They are considered as representations (encodings) of the same text composed of 3 meaningful signs (to avoid abuse of the term "characters").

One may want to be able to extract (copy) any of those meaningful signs regardless whether a particular text representation (encoding) is mono- or multi-element one. I've spent quite a time looking around for a satisfactory equivalent of Copy that I used to in Delphi 5.

Question. Do such equivalents exist or I have to write them myself?

回答1:

What you have described is how Copy(), Delete(), and Insert() have ALWAYS worked, even for AnsiString. The functions operate on elements (ie codeunits in Unicode terminology), and always have.

AnsiString is a string of 8bit AnsiChar elements, which can be encoded in any 8bit ANSI/MBCS format, including UTF-8.

UnicodeString (and WideString) is a string of 16bit WideChar elements, which are encoded in UTF-16.

The functions HAVE NEVER taken encoding into account. Not for MBCS AnsiString. Not for UTF-16 UnicodeString. Indexes are absolute element indexes from the beginning of the string.

If you need encoding-aware Copy/Delete/Insert functions that operate on logical codepoint boundaries, where each codepoint may be 1+ elements in the string, then you have to write your own functions, or find third-party functions that do what you need. There is no MBCS/UTF-aware mutilator functions in the RTL.

回答2:

You should parse Unicode string youself. Fortunaly the Unicode encoding is designed to make parsing easy. Here is an example how to parse UTF8 string:

program Project9;

{$APPTYPE CONSOLE}

uses
  SysUtils;

function GetFirstCodepointSize(const S: UTF8String): Integer;
var
  B: Byte;

begin
  B:= Byte(S[1]);
  if (B and $80 = 0 ) then
    Result:= 1
  else if (B and $E0 = $C0) then
    Result:= 2
  else if (B and $F0 = $E0) then
    Result:= 3
  else if (B and $F8 = $F0) then
    Result:= 4
  else
    Result:= -1; // invalid code
end;

var
  S: string;

begin
  S:= #$61#$13000#$63;
  Writeln(GetFirstCodepointSize(S));
  S:= #$13000#$63;
  Writeln(GetFirstCodepointSize(S));
  S:= #$63;
  Writeln(GetFirstCodepointSize(S));
  Readln;
end.