Finding which encodings in .NET are ASCII-compatib

2019-05-31 17:01发布

问题:

Is there actually any simple method of finding which encodings in .NET are ASCII-compatible?

(Based on the question posed in Nyerguds's comment.)

回答1:

We will assume the standard definition of ASCII that is limited to 128 characters (namely, byte values whose most significant bit is 0). Unicode was designed such that its first 128 code points correspond to their ASCII equivalents. Since the numeric value of the char structure in .NET corresponds to its Unicode code point (except for surrogates), we can define a utility method like so:

private static readonly byte[] asciiValues = 
    Enumerable.Range(0, 128).Select(b => (byte)b).ToArray();

private static readonly string asciiChars = 
    new string(asciiValues.Select(b => (char)b).ToArray());

public static bool IsAsciiCompatible(Encoding encoding)
{
    try
    {
        return encoding.GetString(asciiValues).Equals(asciiChars, StringComparison.Ordinal)
            && encoding.GetBytes(asciiChars).SequenceEqual(asciiValues);
    }
    catch (ArgumentException)
    {
        // Encoding.GetString may throw DecoderFallbackException if a fallback occurred 
        // and DecoderFallback is set to DecoderExceptionFallback.
        // Encoding.GetBytes may throw EncoderFallbackException if a fallback occurred 
        // and EncoderFallback is set to EncoderExceptionFallback.
        // Both of these derive from ArgumentException.
        return false;
    }
}

We could then enumerate all .NET encodings like so:

var encodings = Encoding.GetEncodings().Select(e => e.GetEncoding()).ToList();
var asciiCompatible = encodings.Where(e => IsAsciiCompatible(e)).ToList();
var nonAsciiCompatbile = encodings.Except(asciiCompatible).ToList();

Console.WriteLine("ASCII compatible: ");
foreach (var encodingName in asciiCompatible.Select(e => e.EncodingName).OrderBy(n => n))
    Console.WriteLine("* " + encodingName);
Console.WriteLine();
Console.WriteLine("Non-ASCII compatible: ");
foreach (var encodingName in nonAsciiCompatbile.Select(e => e.EncodingName).OrderBy(n => n))
    Console.WriteLine("* " + encodingName);

Note that this method is not entirely safe. If there exists a multi-byte encoding that does fancy mappings of consecutive bytes or characters – such as decoding 0x61 to 'a' and 0x62 to 'b' (like in ASCII) but 0x6261 to "�" – then this test would give incorrect results.

Running this on .NET Fiddle (snippet) gives the following results:

ASCII compatible:

  • Arabic (864)
  • Arabic (ASMO 708)
  • Arabic (DOS)
  • Arabic (ISO)
  • Arabic (Mac)
  • Arabic (Windows)
  • Baltic (DOS)
  • Baltic (ISO)
  • Baltic (Windows)
  • Central European (DOS)
  • Central European (ISO)
  • Central European (Mac)
  • Central European (Windows)
  • Chinese Simplified (EUC)
  • Chinese Simplified (GB18030)
  • Chinese Simplified (GB2312)
  • Chinese Simplified (GB2312-80)
  • Chinese Simplified (ISO-2022)
  • Chinese Simplified (Mac)
  • Chinese Traditional (Big5)
  • Chinese Traditional (CNS)
  • Chinese Traditional (Eten)
  • Chinese Traditional (Mac)
  • Croatian (Mac)
  • Cyrillic (DOS)
  • Cyrillic (ISO)
  • Cyrillic (KOI8-R)
  • Cyrillic (KOI8-U)
  • Cyrillic (Mac)
  • Cyrillic (Windows)
  • Estonian (ISO)
  • French Canadian (DOS)
  • Greek (DOS)
  • Greek (ISO)
  • Greek (Mac)
  • Greek (Windows)
  • Greek, Modern (DOS)
  • Hebrew (DOS)
  • Hebrew (ISO-Logical)
  • Hebrew (ISO-Visual)
  • Hebrew (Mac)
  • Hebrew (Windows)
  • IBM5550 Taiwan
  • Icelandic (DOS)
  • Icelandic (Mac)
  • ISCII Assamese
  • ISCII Bengali
  • ISCII Devanagari
  • ISCII Gujarati
  • ISCII Kannada
  • ISCII Malayalam
  • ISCII Oriya
  • ISCII Punjabi
  • ISCII Tamil
  • ISCII Telugu
  • Japanese (EUC)
  • Japanese (JIS 0208-1990 and 0212-1990)
  • Japanese (Mac)
  • Japanese (Shift-JIS)
  • Korean
  • Korean (EUC)
  • Korean (Johab)
  • Korean (Mac)
  • Korean Wansung
  • Latin 3 (ISO)
  • Latin 9 (ISO)
  • Nordic (DOS)
  • OEM Cyrillic
  • OEM Multilingual Latin I
  • OEM United States
  • Portuguese (DOS)
  • Romanian (Mac)
  • TCA Taiwan
  • TeleText Taiwan
  • Thai (Windows)
  • Turkish (DOS)
  • Turkish (ISO)
  • Turkish (Mac)
  • Turkish (Windows)
  • Ukrainian (Mac)
  • Unicode (UTF-8)
  • US-ASCII
  • Vietnamese (Windows)
  • Wang Taiwan
  • Western European (DOS)
  • Western European (ISO)
  • Western European (Mac)
  • Western European (Windows)

Non-ASCII compatible:

  • Chinese Simplified (HZ)
  • Europa
  • German (IA5)
  • IBM EBCDIC (Arabic)
  • IBM EBCDIC (Cyrillic Russian)
  • IBM EBCDIC (Cyrillic Serbian-Bulgarian)
  • IBM EBCDIC (Denmark-Norway)
  • IBM EBCDIC (Denmark-Norway-Euro)
  • IBM EBCDIC (Finland-Sweden)
  • IBM EBCDIC (Finland-Sweden-Euro)
  • IBM EBCDIC (France)
  • IBM EBCDIC (France-Euro)
  • IBM EBCDIC (Germany)
  • IBM EBCDIC (Germany-Euro)
  • IBM EBCDIC (Greek Modern)
  • IBM EBCDIC (Greek)
  • IBM EBCDIC (Hebrew)
  • IBM EBCDIC (Icelandic)
  • IBM EBCDIC (Icelandic-Euro)
  • IBM EBCDIC (International)
  • IBM EBCDIC (International-Euro)
  • IBM EBCDIC (Italy)
  • IBM EBCDIC (Italy-Euro)
  • IBM EBCDIC (Japanese katakana)
  • IBM EBCDIC (Korean Extended)
  • IBM EBCDIC (Multilingual Latin-2)
  • IBM EBCDIC (Spain)
  • IBM EBCDIC (Spain-Euro)
  • IBM EBCDIC (Thai)
  • IBM EBCDIC (Turkish Latin-5)
  • IBM EBCDIC (Turkish)
  • IBM EBCDIC (UK)
  • IBM EBCDIC (UK-Euro)
  • IBM EBCDIC (US-Canada)
  • IBM EBCDIC (US-Canada-Euro)
  • IBM Latin-1
  • IBM Latin-1
  • ISO-6937
  • Japanese (JIS)
  • Japanese (JIS-Allow 1 byte Kana - SO/SI)
  • Japanese (JIS-Allow 1 byte Kana)
  • Korean (ISO)
  • Norwegian (IA5)
  • Swedish (IA5)
  • T.61
  • Thai (Mac)
  • Unicode (UTF-16)
  • Unicode (Big-Endian)
  • Unicode (UTF-32 Big-Endian)
  • Unicode (UTF-32)
  • Unicode (UTF-7)
  • Western European (IA5)