How to Generate all the characters in the UTF-8 ch

2020-02-09 03:32发布

I have been given the task of generating all the characters in the UTF-8 character set to test how a system handles each of them. I do not have much experience with character encoding. The approaching I was going to try was to increment a counter, and then try to translate that base ten number into it's equivalent UTF-8 character, but so far I have no been able to find an effective way to to this in C# 3.5

Any suggestions would be greatly appreciated.

9条回答
老娘就宠你
2楼-- · 2020-02-09 03:55
System.Net.WebClient client = new System.Net.WebClient();
string definedCodePoints = client.DownloadString(
                         "http://unicode.org/Public/UNIDATA/UnicodeData.txt");
System.IO.StringReader reader = new System.IO.StringReader(definedCodePoints);
System.Text.UTF8Encoding encoder = new System.Text.UTF8Encoding();
while(true) {
  string line = reader.ReadLine();
  if(line == null) break;
  int codePoint = Convert.ToInt32(line.Substring(0, line.IndexOf(";")), 16);
  if(codePoint >= 0xD800 && codePoint <= 0xDFFF) {
    //surrogate boundary; not valid codePoint, but listed in the document
  } else {
    string utf16 = char.ConvertFromUtf32(codePoint);
    byte[] utf8 = encoder.GetBytes(utf16);
    //TODO: something with the UTF-8-encoded character
  }
}

The above code should iterate over the currently assigned Unicode characters. You'll probably want to parse the UnicodeData file locally and fix any C# blunders I've made.

The set of currently assigned Unicode characters is less than the set that could be defined. Of course, whether you see a character when you print one of them out depends on a great many other factors, like fonts and the other applications it'll pass through before it is emitted to your eyeball.

查看更多
唯我独甜
3楼-- · 2020-02-09 04:01

This code will produce the output in a file. All characters printable or not will be in there.

Encoding enc = (Encoding)Encoding.GetEncoding("utf-8").Clone();
enc.EncoderFallback = new EncoderReplacementFallback("");
char[] chars = new char[1];
byte[] bytes = new byte[16];

using (StreamWriter sw = new StreamWriter(@"C:\utf-8.txt"))
{
    for (int i = 0; i <= char.MaxValue; i++)
    {
        chars[0] = (char)i;
        int count = enc.GetBytes(chars, 0, 1, bytes, 0);

        if (count != 0)
        {
            sw.WriteLine(chars[0]);
        }
    }
}
查看更多
手持菜刀,她持情操
4楼-- · 2020-02-09 04:06

This will give you all the characters in a charset - just make sure you specify a charset when specifying the Encoding:

var results = new ConcurrentBag<int> ();
Parallel.For (0, 10, set => {
    var encoding = Encoding.GetEncoding ("ISO-8859-1");
    var c = encoding.GetEncoder ();
    c.Fallback = new EncoderExceptionFallback ();
    var start = set * 1000;
    var end = start + 1000;
    Console.WriteLine ("Worker #{0}: {1} - {2}", set, start, end);

    char[] input = new char[1];
    byte[] output = new byte[5];
    for (int i = start; i < end; i++) {
        try {
            input[0] = (char)i;
            c.GetBytes (input, 0, 1, output, 0, true);
            results.Add (i);
        }
        catch {
        }
    }
});
var hashSet = new HashSet<int> (results);
//hashSet.Remove ((int)'\r');
//hashSet.Remove ((int)'\n');
var sorted = hashSet.ToArray ();
Array.Sort (sorted);
var charset = new string (sorted.Select (i => (char)i).ToArray ());
查看更多
登录 后发表回答