The IMAP specification (RFC 2060, 5.1.3. Mailbox International Naming Convention) describes how to handle non-ASCII characters in folder names. It defines a modified UTF-7 encoding:
By convention, international mailbox
names are specified using a
modified version of the UTF-7 encoding
described in [UTF-7]. The purpose
of these modifications is to correct
the following problems with UTF-7:
UTF-7 uses the "+" character for shifting; this conflicts with
the common use of "+" in mailbox names, in particular USENET
newsgroup names.
UTF-7's encoding is BASE64 which uses the "/" character; this
conflicts with the use of "/" as a popular hierarchy delimiter.
UTF-7 prohibits the unencoded usage of "\"; this conflicts with
the use of "\" as a popular hierarchy delimiter.
UTF-7 prohibits the unencoded usage of "~"; this conflicts with
the use of "~" in some servers as a home directory indicator.
UTF-7 permits multiple alternate forms to represent the same
string; in particular, printable US-ASCII chararacters can be
represented in encoded form.
In modified UTF-7, printable US-ASCII characters except for "&" represent themselves;
that is, characters with octet values 0x20-0x25
and 0x27-0x7e. The character "&"
(0x26) is represented by the two-octet sequence "&-".
All other characters (octet values
0x00-0x1f, 0x7f-0xff, and all Unicode 16-bit octets) are represented
in modified BASE64, with a further
modification from [UTF-7] that "," is
used instead of "/".
Modified BASE64 MUST NOT be used to represent
any printing US-ASCII character
which can represent itself.
"&" is used to shift to modified
BASE64 and "-" to shift back to US-ASCII. All names start in US-ASCII,
and MUST end in US-ASCII (that is,
a name that ends with a Unicode 16-bit
octet MUST end with a "-").
Before I'll start implementing it, my question: is there some .NET code/library out there (or even in the framework) that does the job? I couldn't find .NET resources (only implementations for other languages/frameworks).
Thank you!
This is too specialized to be present in a framework. There might be something on codeplex though many incomplete "implementations" I've seen don't do bother with the conversion at all and will happily pass all non-us-ascii characters on to the IMAP server.
However I've implemented this in the past and it is really just 30 lines of code. You go through all characters in a string and output them if they fall in the range between 0x20 and 0x7e (don't forget to append "-" after the "&") otherwise collect all non-us-ascii and convert them using UTF7 (or UTF8 + base64, I'm not quite sure here) replacing "/" with ",". Additionally you need to maintain "shifted state", e.g. whether you're currently encoding non-us-ascii or outputting us-ascii and append transition tokens "&" and "-" on state change.
Not tested, but this MIT-licensed code looks fine, if Alekseys bugfix is applied:
/// <summary>
/// Takes a UTF-16 encoded string and encodes it as modified UTF-7.
/// </summary>
/// <param name="s">The string to encode.</param>
/// <returns>A UTF-7 encoded string</returns>
/// <remarks>IMAP uses a modified version of UTF-7 for encoding international mailbox names. For
/// details, refer to RFC 3501 section 5.1.3 (Mailbox International Naming Convention).</remarks>
internal static string UTF7Encode(string s) {
StringReader reader = new StringReader(s);
StringBuilder builder = new StringBuilder();
while (reader.Peek() != -1) {
char c = (char)reader.Read();
int codepoint = Convert.ToInt32(c);
// It's a printable ASCII character.
if (codepoint > 0x1F && codepoint < 0x7F) {
builder.Append(c == '&' ? "&-" : c.ToString());
} else {
// The character sequence needs to be encoded.
StringBuilder sequence = new StringBuilder(c.ToString());
while (reader.Peek() != -1) {
codepoint = Convert.ToInt32((char)reader.Peek());
if (codepoint > 0x1F && codepoint < 0x7F)
break;
sequence.Append((char)reader.Read());
}
byte[] buffer = Encoding.BigEndianUnicode.GetBytes(
sequence.ToString());
string encoded = Convert.ToBase64String(buffer).Replace('/', ',').
TrimEnd('=');
builder.Append("&" + encoded + "-");
}
}
return builder.ToString();
}
/// <summary>
/// Takes a modified UTF-7 encoded string and decodes it.
/// </summary>
/// <param name="s">The UTF-7 encoded string to decode.</param>
/// <returns>A UTF-16 encoded "standard" C# string</returns>
/// <exception cref="FormatException">The input string is not a properly UTF-7 encoded
/// string.</exception>
/// <remarks>IMAP uses a modified version of UTF-7 for encoding international mailbox names. For
/// details, refer to RFC 3501 section 5.1.3 (Mailbox International Naming Convention).</remarks>
internal static string UTF7Decode(string s) {
StringReader reader = new StringReader(s);
StringBuilder builder = new StringBuilder();
while (reader.Peek() != -1) {
char c = (char)reader.Read();
if (c == '&' && reader.Peek() != '-') {
// The character sequence needs to be decoded.
StringBuilder sequence = new StringBuilder();
while (reader.Peek() != -1) {
if ((c = (char)reader.Read()) == '-')
break;
sequence.Append(c);
}
string encoded = sequence.ToString().Replace(',', '/');
int pad = encoded.Length % 4;
if (pad > 0)
encoded = encoded.PadRight(encoded.Length + (4 - pad), '=');
try {
byte[] buffer = Convert.FromBase64String(encoded);
builder.Append(Encoding.BigEndianUnicode.GetString(buffer));
} catch (Exception e) {
throw new FormatException(
"The input string is not in the correct Format.", e);
}
} else {
if (c == '&' && reader.Peek() == '-')
reader.Read();
builder.Append(c);
}
}
return builder.ToString();
}
Don't use this code in it's current state, it contains [...] UTF7.GetBytes([...]) [...] .Replace('+', '&')
- it uses the existing .Net UTF-7 encoding routine and (among other things) replaces +
with &
in the result. This is wrong, because it does not only change the "shift character" from +
to &
(which is intended and correct) but also all +
chars within base64 encoded regions (which must not be changed to &
).