IMAP folder path encoding (IMAP UTF-7) for .NET?

2019-05-18 06:56发布

问题:

The IMAP specification (RFC 2060, 5.1.3. Mailbox International Naming Convention) describes how to handle non-ASCII characters in folder names. It defines a modified UTF-7 encoding:

By convention, international mailbox names are specified using a modified version of the UTF-7 encoding described in [UTF-7]. The purpose of these modifications is to correct the following problems with UTF-7:

  1. UTF-7 uses the "+" character for shifting; this conflicts with the common use of "+" in mailbox names, in particular USENET newsgroup names.

  2. UTF-7's encoding is BASE64 which uses the "/" character; this conflicts with the use of "/" as a popular hierarchy delimiter.

  3. UTF-7 prohibits the unencoded usage of "\"; this conflicts with the use of "\" as a popular hierarchy delimiter.

  4. UTF-7 prohibits the unencoded usage of "~"; this conflicts with the use of "~" in some servers as a home directory indicator.

  5. UTF-7 permits multiple alternate forms to represent the same string; in particular, printable US-ASCII chararacters can be represented in encoded form.

In modified UTF-7, printable US-ASCII characters except for "&" represent themselves; that is, characters with octet values 0x20-0x25 and 0x27-0x7e. The character "&" (0x26) is represented by the two-octet sequence "&-".

All other characters (octet values 0x00-0x1f, 0x7f-0xff, and all Unicode 16-bit octets) are represented in modified BASE64, with a further modification from [UTF-7] that "," is used instead of "/".
Modified BASE64 MUST NOT be used to represent any printing US-ASCII character which can represent itself.

"&" is used to shift to modified BASE64 and "-" to shift back to US-ASCII. All names start in US-ASCII, and MUST end in US-ASCII (that is, a name that ends with a Unicode 16-bit octet MUST end with a "-").

Before I'll start implementing it, my question: is there some .NET code/library out there (or even in the framework) that does the job? I couldn't find .NET resources (only implementations for other languages/frameworks).

Thank you!

回答1:

This is too specialized to be present in a framework. There might be something on codeplex though many incomplete "implementations" I've seen don't do bother with the conversion at all and will happily pass all non-us-ascii characters on to the IMAP server.

However I've implemented this in the past and it is really just 30 lines of code. You go through all characters in a string and output them if they fall in the range between 0x20 and 0x7e (don't forget to append "-" after the "&") otherwise collect all non-us-ascii and convert them using UTF7 (or UTF8 + base64, I'm not quite sure here) replacing "/" with ",". Additionally you need to maintain "shifted state", e.g. whether you're currently encoding non-us-ascii or outputting us-ascii and append transition tokens "&" and "-" on state change.



回答2:

Not tested, but this MIT-licensed code looks fine, if Alekseys bugfix is applied:

    /// <summary>
    /// Takes a UTF-16 encoded string and encodes it as modified UTF-7.
    /// </summary>
    /// <param name="s">The string to encode.</param>
    /// <returns>A UTF-7 encoded string</returns>
    /// <remarks>IMAP uses a modified version of UTF-7 for encoding international mailbox names. For
    /// details, refer to RFC 3501 section 5.1.3 (Mailbox International Naming Convention).</remarks>
    internal static string UTF7Encode(string s) {
        StringReader reader = new StringReader(s);
        StringBuilder builder = new StringBuilder();
        while (reader.Peek() != -1) {
            char c = (char)reader.Read();
            int codepoint = Convert.ToInt32(c);
            // It's a printable ASCII character.
            if (codepoint > 0x1F && codepoint < 0x7F) {
                builder.Append(c == '&' ? "&-" : c.ToString());
            } else {
                // The character sequence needs to be encoded.
                StringBuilder sequence = new StringBuilder(c.ToString());
                while (reader.Peek() != -1) {
                    codepoint = Convert.ToInt32((char)reader.Peek());
                    if (codepoint > 0x1F && codepoint < 0x7F)
                        break;
                    sequence.Append((char)reader.Read());
                }
                byte[] buffer = Encoding.BigEndianUnicode.GetBytes(
                    sequence.ToString());
                string encoded = Convert.ToBase64String(buffer).Replace('/', ',').
                    TrimEnd('=');
                builder.Append("&" + encoded + "-");
            }
        }
        return builder.ToString();
    }

    /// <summary>
    /// Takes a modified UTF-7 encoded string and decodes it.
    /// </summary>
    /// <param name="s">The UTF-7 encoded string to decode.</param>
    /// <returns>A UTF-16 encoded "standard" C# string</returns>
    /// <exception cref="FormatException">The input string is not a properly UTF-7 encoded
    /// string.</exception>
    /// <remarks>IMAP uses a modified version of UTF-7 for encoding international mailbox names. For
    /// details, refer to RFC 3501 section 5.1.3 (Mailbox International Naming Convention).</remarks>
    internal static string UTF7Decode(string s) {
        StringReader reader = new StringReader(s);
        StringBuilder builder = new StringBuilder();
        while (reader.Peek() != -1) {
            char c = (char)reader.Read();
            if (c == '&' && reader.Peek() != '-') {
                // The character sequence needs to be decoded.
                StringBuilder sequence = new StringBuilder();
                while (reader.Peek() != -1) {
                    if ((c = (char)reader.Read()) == '-')
                        break;
                    sequence.Append(c);
                }
                string encoded = sequence.ToString().Replace(',', '/');
                int pad = encoded.Length % 4;
                if (pad > 0)
                    encoded = encoded.PadRight(encoded.Length + (4 - pad), '=');
                try {
                    byte[] buffer = Convert.FromBase64String(encoded);
                    builder.Append(Encoding.BigEndianUnicode.GetString(buffer));
                } catch (Exception e) {
                    throw new FormatException(
                        "The input string is not in the correct Format.", e);
                }
            } else {
                if (c == '&' && reader.Peek() == '-')
                    reader.Read();
                builder.Append(c);
            }
        }
        return builder.ToString();
    }

Don't use this code in it's current state, it contains [...] UTF7.GetBytes([...]) [...] .Replace('+', '&') - it uses the existing .Net UTF-7 encoding routine and (among other things) replaces + with & in the result. This is wrong, because it does not only change the "shift character" from + to & (which is intended and correct) but also all + chars within base64 encoded regions (which must not be changed to &).