Section#1
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdint.h>
int main(int argc, char **argv)
{
static const unsigned char text[] = "000ßh123456789";
int32_t current=1;
int32_t text_len = strlen(text)-1;
/////////////////////////////////
printf("Result : %s\n",text);
/////////////////////////////////
printf("Lenght : %d\n",text_len);
/////////////////////////////////
printf("Index0 : %c\n",text[0]);
printf("Index1 : %c\n",text[1]);
printf("Index2 : %c\n",text[2]);
printf("Index3 : %c\n",text[3]);//==> why show this `�`?
printf("Index4 : %c\n",text[4]);//==> why show this `�`?
printf("Index0 : %c\n",text[5]);
/////////////////////////////////
return 0;
}
why text[3]
and text[4]
show �
?
how can also support utf-8 character in Index
?
Section#2
I want write a function like mb_substr
in php.
(verybigstring or string)
mb_substr ( (verybigstring or string)
input , (verybigint or int)
start [, (verybigint or int)
$length = NULL ] )
Some Example:
mb_substr("hello world",0);
==>
hello world
mb_substr("hello world",1);
==>
ello world
mb_substr_two("hello world",1,3);
==>
el
mb_substr("hello world",-3);
==>
rld
mb_substr_two("hello world",-3,2);
==>
rldhe
My Question is Section#1
Can anyone help me?(please)
The Unicode character set currently includes more than 128,000 characters (which I shall henceforth call Code Points to avoid confusion) with space reserved for far, far more. As such, a
char
which is only 8 bits in size on modern general-computing machines can't be used to contain a Code Point.UTF-8 is a way of encoding these Code Points into bytes. The following are the bytes you placed in
text[]
(assuming UTF-8 was used to encode the Code Points) and what they represent:As you can see, UTF-8 is a variable-width encoding. A single Code Points encodes to a variable number of bytes. This means you can't translate indexes-into-text into indexes-into-array-of-bytes without scanning the array.
A Code Point encoded using UTF-8 starts with
The only other form of bytes you will encounter in UTF-8 is
A simple way to find the nth Code Point in a string (if you assume the input is valid UTF-8) is to search for the nth
char
for which(ch & 0xC0) != 0xC0
is true. You can use the same approach to count the number of Code Points in a string.