Section#1
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdint.h>
int main(int argc, char **argv)
{
static const unsigned char text[] = "000ßh123456789";
int32_t current=1;
int32_t text_len = strlen(text)-1;
/////////////////////////////////
printf("Result : %s\n",text);
/////////////////////////////////
printf("Lenght : %d\n",text_len);
/////////////////////////////////
printf("Index0 : %c\n",text[0]);
printf("Index1 : %c\n",text[1]);
printf("Index2 : %c\n",text[2]);
printf("Index3 : %c\n",text[3]);//==> why show this `�`?
printf("Index4 : %c\n",text[4]);//==> why show this `�`?
printf("Index0 : %c\n",text[5]);
/////////////////////////////////
return 0;
}
why text[3]
and text[4]
show �
?
how can also support utf-8 character in Index
?
Section#2
I want write a function like mb_substr
in php.
(verybigstring or string)
mb_substr ( (verybigstring or string)
input , (verybigint or int)
start [, (verybigint or int)
$length = NULL ] )
Some Example:
mb_substr("hello world",0);
==>hello world
mb_substr("hello world",1);
==>ello world
mb_substr_two("hello world",1,3);
==>el
mb_substr("hello world",-3);
==>rld
mb_substr_two("hello world",-3,2);
==>rldhe
My Question is Section#1
Can anyone help me?(please)
The Unicode character set currently includes more than 128,000 characters (which I shall henceforth call Code Points to avoid confusion) with space reserved for far, far more. As such, a char
which is only 8 bits in size on modern general-computing machines can't be used to contain a Code Point.
UTF-8 is a way of encoding these Code Points into bytes. The following are the bytes you placed in text[]
(assuming UTF-8 was used to encode the Code Points) and what they represent:
i: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
text[i]: 0x30 30 30 C3 9F 68 31 32 33 34 35 36 37 38 39 00
-- -- -- ----- -- -- -- -- -- -- -- -- -- -- --
Code Point: U+30 30 30 DF 68 31 32 33 34 35 36 37 38 39 0
Graph: 0 0 0 ß h 1 2 3 4 5 6 7 8 9
As you can see, UTF-8 is a variable-width encoding. A single Code Points encodes to a variable number of bytes. This means you can't translate indexes-into-text into indexes-into-array-of-bytes without scanning the array.
A Code Point encoded using UTF-8 starts with
0b0xxxxxxx Represents an entire Code Point
0b110xxxxx The start of a 2-byte sequence
0b1110xxxx The start of a 3-byte sequence
0b11110xxx The start of a 4-byte sequence
The only other form of bytes you will encounter in UTF-8 is
0b10xxxxxx A continuation byte (the 2nd, 3rd or 4th byte of sequence)
A simple way to find the nth Code Point in a string (if you assume the input is valid UTF-8) is to search for the nth char
for which (ch & 0xC0) != 0xC0
is true. You can use the same approach to count the number of Code Points in a string.