We got a problem with the encoding of files inside a zip-file.
We are using the ionic zip to compress and decompress archives.
We are a located in Denmark, so we often have files containing æ, ø or å in the file-names.
When a user uses windows built-in tool to compress files, then I found that it was using the IBM437 enconding, this just gave some funky results when we had files with 'ø' / 'Ø' in them. This I fixed with the following code:
public static string IBM437Encode(this string text)
{
return text.Replace('ø', '¢').Replace('Ø', '¥');
}
public static string IBM437Decode(this string text)
{
return text.Replace('¢', 'ø').Replace('¥', 'Ø');
}
This has been running for some time now, and all has been fine.
But, because theres always a but, we didn't try it with a file compressed with the default tool in mac osx.
So now we got a new problem..
When using æ, ø and å the encoding is UTF-8!
So I can get it to work if I know where the zip has been compressed, but is there any easy way to detect or normalize the encoding inside a zip?
Detecting encoding is always a tricky business, but UTF8 has strict bitwise rules about what values are expected in a valid sequence, and you can initialize a UTF8Encoding object in a way that will fail by throwing an exception when these sequences are incorrect:
public static Boolean MatchesUtf8Encoding(Byte[] bytes)
{
UTF8Encoding enc = new UTF8Encoding(false, true);
try { enc.GetString(bytes) }
catch(ArgumentException) { return false; }
return true;
}
If you'd run that over all filenames in a zip you can determine if it fails anywhere, in which case you can conclude the names are not saved as UTF-8.
Do note that besides UTF-8 there's also the annoying difference between the computer's default encoding (Encoding.Default
, usually Windows-1252 in US and Western EU countries, but annoyingly different depending on regions and languages) and the DOS-437 encoding you already encountered.
Making the distinction between those is very, very hard, and would probably need to be done by actually checking for each encoding which ranges beyond byte 0x80 produce normal accented characters, and which are special characters you generally won't expect to encounter in a file name. For example, a lot of the DOS-437 characters are frames that were used to draw semi-graphical user interfaces in DOS.
For reference, these are the special characters (so the byte range 0x80-0xFF) in DOS-437:
80 ÇüéâäàåçêëèïîìÄÅ
90 ÉæÆôöòûùÿÖÜ¢£¥₧ƒ
A0 áíóúñѪº¿⌐¬½¼¡«»
B0 ░▒▓│┤╡╢╖╕╣║╗╝╜╛┐
C0 └┴┬├─┼╞╟╚╔╩╦╠═╬╧
D0 ╨╤╥╙╘╒╓╫╪┘┌█▄▌▐▀
E0 αßΓπΣσµτΦΘΩδ∞φε∩
F0 ≡±≥≤⌠⌡÷≈°∙·√ⁿ²■
And in Windows-1252:
80 €�‚ƒ„…†‡ˆ‰Š‹Œ�Ž�
90 �‘’“”•–—˜™š›œ�žŸ
A0 ¡¢£¤¥¦§¨©ª«¬�®¯
B0 °±²³´µ¶·¸¹º»¼½¾¿
C0 ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏ
D0 ÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß
E0 àáâãäåæçèéêëìíîï
F0 ðñòóôõö÷øùúûüýþÿ
Some of these aren't even printable, so that makes it a bit easier.
As you see, generally, DOS-437 has most of its accented characters in the 0x80-0xA5 region (with the Beta at 0xE1 often used in Germany as eszett), whereas Win-1252 has practically all of them in the region 0xC0-0xFF. If you determine these regions you can make a scan mechanism that evaluates which encoding it seems to lean towards, simply by counting how many fall inside and outside the expected ranges for each.
Note that Char
in c# represents a unicode character, no matter what it was loaded from as bytes, and unicode characters have certain classifications you can look up programmatically that distinguish them between normal letters (possibly with diacritics) and various classes of special characters (simple example: I know one of these classes is "whitespace characters"). It may be worth looking into that system to automate the process of determining what "normal language characters" are.