How to Generate all the characters in the UTF-8 ch

I have been given the task of generating all the characters in the UTF-8 character set to test how a system handles each of them. I do not have much experience with character encoding. The approaching I was going to try was to increment a counter, and then try to translate that base ten number into it's equivalent UTF-8 character, but so far I have no been able to find an effective way to to this in C# 3.5

Any suggestions would be greatly appreciated.

标签： c# .net utf-8 character-encoding

9条回答

你好瞎i

2楼-- · 2020-02-09 03:41

As other people have said, UTF / Unicode is an encoding not a character set.

If you skim though http://www.joelonsoftware.com/articles/Unicode.html it should help clarify what unicode is.

0人赞添加讨论(0) 举报

萌系小妹纸

3楼-- · 2020-02-09 03:42

There is no "UTF-8 characters". Do you mean Unicode characters or UTF-8 encoding of Unicode characters?

It's easy to convert an int to a Unicode character, provided of course that there is a mapping for that code:

char c = (char)theNumber;

If you want the UTF-8 encoding for that character, that's not very hard either:

byte[] encoded = Encoding.UTF8.GetBytes(c.ToString())

You would have to check the Unicode standard to see the number ranges where there are Unicode characters defined.

0人赞添加讨论(0) 举报

劫难

4楼-- · 2020-02-09 03:44

Even once you generate all the characters, you'll find it's not an effective test. Some of the characters are combining marks, which means they will combine with the next character to come after them - having a string full of combining marks won't make much sense. There are other special cases too. You'll be much better off using actual text in the languages you need to support.

0人赞添加讨论(0) 举报

smile是对你的礼貌

5楼-- · 2020-02-09 03:50

UTF-8 isn't a character set - it's a character encoding which is capable of encoding any character in the Unicode character set into binary data.

Could you give more information about what you're trying to do? You could encode all the possible Unicode characters (including ones which aren't allocated at the moment) although if you need to cope with characters outside the basic multilingual plane (i.e. those above U+FFFF) then it becomes slightly trickier...

0人赞添加讨论(0) 举报

仙女界的扛把子

6楼-- · 2020-02-09 03:50

UTF-8 is not a charset, it's an encoding. Any value in Unicode can be encoded in UTF-8 with different byte lengths.

For .net, the characters are 16-bit (it's not the complete set of unicode but is the most practical), so you can try this:

 for (char i = 0; i < 65536; i++) {
     string s = "" + i;
     byte[] bytes = Encoding.UTF8.GetBytes(s);
     // do something with bytes
 }

0人赞添加讨论(0) 举报

smile是对你的礼貌

7楼-- · 2020-02-09 03:52

You can brute-force an Encoding to figure out which code points it supports. To do so, simply go through all possible code points, convert them to strings, and see if Encoding.GetBytes() throws an exception or not (after setting Encoding.EncoderFallback to EncoderExceptionFallback).

IEnumerable<int> GetAllWritableCodepoints(Encoding encoding)
{
    encoding = Encoding.GetEncoding(encoding.WebName, new EncoderExceptionFallback(), new DecoderExceptionFallback());

    var i = -1;
    // Docs for char.ConvertFromUtf32() say that 0x10ffff is the maximum code point value.
    while (i != 0x10ffff)
    {
        i++;

        var success = false;
        try
        {
            encoding.GetByteCount(char.ConvertFromUtf32(i));
            success = true;
        }
        catch (ArgumentException)
        {
        }
        if (success)
        {
            yield return i;
        }
    }
}

This method should support discovering characters represented by surrogate pairs of Char in .net. However, it is very slow (takes minutes to run on my machine) and probably impractical.

0人赞添加讨论(0) 举报

1 2 下一页

How to Generate all the characters in the UTF-8 ch

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间