Section#1

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdint.h>

int main(int argc, char **argv)
{
    static const unsigned char text[] = "000ßh123456789";
    int32_t current=1;
    int32_t text_len = strlen(text)-1;
    /////////////////////////////////
    printf("Result : %s\n",text);
    /////////////////////////////////
    printf("Lenght : %d\n",text_len);
    /////////////////////////////////
    printf("Index0 : %c\n",text[0]);
    printf("Index1 : %c\n",text[1]);
    printf("Index2 : %c\n",text[2]);
    printf("Index3 : %c\n",text[3]);//==> why show this `�`?
    printf("Index4 : %c\n",text[4]);//==> why show this `�`?
    printf("Index0 : %c\n",text[5]);
    /////////////////////////////////
    return 0;
}

why text[3] and text[4] show �?

how can also support utf-8 character in Index?

Section#2

I want write a function like mb_substr in php.

(verybigstring or string) mb_substr ( (verybigstring or string) input , (verybigint or int) start [, (verybigint or int) $length = NULL ] )

Some Example:

mb_substr("hello world",0);

==>hello world
mb_substr("hello world",1);

==>ello world
mb_substr_two("hello world",1,3);

==>el
mb_substr("hello world",-3);

==>rld
mb_substr_two("hello world",-3,2);

==>rldhe

My Question is Section#1

Can anyone help me?(please)

标签： c string unicode utf-8 substring

1条回答

劳资没心，怎么记你

2楼-- · 2019-03-06 23:44

The Unicode character set currently includes more than 128,000 characters (which I shall henceforth call Code Points to avoid confusion) with space reserved for far, far more. As such, a char which is only 8 bits in size on modern general-computing machines can't be used to contain a Code Point.

UTF-8 is a way of encoding these Code Points into bytes. The following are the bytes you placed in text[] (assuming UTF-8 was used to encode the Code Points) and what they represent:

i:             0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
text[i]:    0x30 30 30 C3 9F 68 31 32 33 34 35 36 37 38 39 00
              -- -- -- ----- -- -- -- -- -- -- -- -- -- -- --
Code Point: U+30 30 30    DF 68 31 32 33 34 35 36 37 38 39  0
Graph:         0  0  0     ß  h  1  2  3  4  5  6  7  8  9

As you can see, UTF-8 is a variable-width encoding. A single Code Points encodes to a variable number of bytes. This means you can't translate indexes-into-text into indexes-into-array-of-bytes without scanning the array.

A Code Point encoded using UTF-8 starts with

0b0xxxxxxx    Represents an entire Code Point
0b110xxxxx    The start of a 2-byte sequence
0b1110xxxx    The start of a 3-byte sequence
0b11110xxx    The start of a 4-byte sequence

The only other form of bytes you will encounter in UTF-8 is

0b10xxxxxx    A continuation byte (the 2nd, 3rd or 4th byte of sequence)

A simple way to find the n^th Code Point in a string (if you assume the input is valid UTF-8) is to search for the n^th char for which (ch & 0xC0) != 0xC0 is true. You can use the same approach to count the number of Code Points in a string.

0人赞添加讨论(0) 举报

How SubString,Limit Using C? [closed]

Section#1

Section#2

My Question is Section#1

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间