I try to detect which character encoding is used in my file.
I try with this code to get the standard encoding
public static Encoding GetFileEncoding(string srcFile)
{
// *** Use Default of Encoding.Default (Ansi CodePage)
Encoding enc = Encoding.Default;
// *** Detect byte order mark if any - otherwise assume default
byte[] buffer = new byte[5];
FileStream file = new FileStream(srcFile, FileMode.Open);
file.Read(buffer, 0, 5);
file.Close();
if (buffer[0] == 0xef && buffer[1] == 0xbb && buffer[2] == 0xbf)
enc = Encoding.UTF8;
else if (buffer[0] == 0xfe && buffer[1] == 0xff)
enc = Encoding.Unicode;
else if (buffer[0] == 0 && buffer[1] == 0 && buffer[2] == 0xfe && buffer[3] == 0xff)
enc = Encoding.UTF32;
else if (buffer[0] == 0x2b && buffer[1] == 0x2f && buffer[2] == 0x76)
enc = Encoding.UTF7;
else if (buffer[0] == 0xFE && buffer[1] == 0xFF)
// 1201 unicodeFFFE Unicode (Big-Endian)
enc = Encoding.GetEncoding(1201);
else if (buffer[0] == 0xFF && buffer[1] == 0xFE)
// 1200 utf-16 Unicode
enc = Encoding.GetEncoding(1200);
return enc;
}
My five first byte are 60, 118, 56, 46 and 49.
Is there a chart that shows which encoding matches those five first bytes?
You can't depend on the file having a BOM. UTF-8 doesn't require it. And non-Unicode encodings don't even have a BOM. There are, however, other ways to detect the encoding.
UTF-32
BOM is 00 00 FE FF (for BE) or FF FE 00 00 (for LE).
But UTF-32 is easy to detect even without a BOM. This is because the Unicode code point range is restricted to U+10FFFF, and thus UTF-32 units always have the pattern 00 {00-10} xx xx (for BE) or xx xx {00-10} 00 (for LE). If the data has a length that's a multiple of 4, and follows one of these patterns, you can safely assume it's UTF-32. False positives are nearly impossible due to the rarity of 00 bytes in byte-oriented encodings.
US-ASCII
No BOM, but you don't need one. ASCII can be easily identified by the lack of bytes in the 80-FF range.
UTF-8
BOM is EF BB BF. But you can't rely on this. Lots of UTF-8 files don't have a BOM, especially if they originated on non-Windows systems.
But you can safely assume that if a file validates as UTF-8, it is UTF-8. False positives are rare.
Specifically, given that the data is not ASCII, the false positive rate for a 2-byte sequence is only 3.9% (1920/49152). For a 7-byte sequence, it's less than 1%. For a 12-byte sequence, it's less than 0.1%. For a 24-byte sequence, it's less than 1 in a million.
UTF-16
BOM is FE FF (for BE) or FF FE (for LE). Note that the UTF-16LE BOM is found at the start of the UTF-32LE BOM, so check UTF-32 first.
If you happen to have a file that consists mainly of ISO-8859-1 characters, having half of the file's bytes be 00 would also be a strong indicator of UTF-16.
Otherwise, the only reliable way to recognize UTF-16 without a BOM is to look for surrogate pairs (D[8-B]xx D[C-F]xx), but non-BMP characters are too rarely-used to make this approach practical.
XML
If your file starts with the bytes 3C 3F 78 6D 6C (i.e., the ASCII characters "<?xml"), then look for an
encoding=
declaration. If present, then use that encoding. If absent, then assume UTF-8, which is the default XML encoding.If you need to support EBCDIC, also look for the equivalent sequence 4C 6F A7 94 93.
In general, if you have a file format that contains an encoding declaration, then look for that declaration rather than trying to guess the encoding.
None of the above
There are hundreds of other encodings, which require more effort to detect. I recommend trying Mozilla's charset detector or a .NET port of it.
A reasonable default
If you've ruled out the UTF encodings, and don't have an encoding declaration or statistical detection that points to a different encoding, assume ISO-8859-1 or the closely related Windows-1252. (Note that the latest HTML standard requires a “ISO-8859-1” declaration to be interpreted as Windows-1252.) Being Windows' default code page for English (and other popular languages like Spanish, Portuguese, German, and French), it's the most commonly encountered encoding other than UTF-8.
If your file starts with the bytes 60, 118, 56, 46 and 49, then you have an ambiguous case. It could be UTF-8 (without BOM) or any of the single byte encodings like ASCII, ANSI, ISO-8859-1 etc.
Yes, there is one here: http://en.wikipedia.org/wiki/Byte_order_mark#Representations_of_byte_order_marks_by_encoding.
If you want to pursue a "simple" solution, you might find this class I put together useful:
http://www.architectshack.com/TextFileEncodingDetector.ashx
It does the BOM detection automatically first, and then tries to differentiate between Unicode encodings without BOM, vs some other default encoding (generally Windows-1252, incorrectly labelled as Encoding.ASCII in .Net).
As noted above, a "heavier" solution involving NCharDet or MLang may be more appropriate, and as I note on the overview page of this class, the best is to provide some form of interactivity with the user if at all possible, because there simply is no 100% detection rate possible!
Use
StreamReader
and direct it to detect the encoding for you:And use Code Page Identifiers https://msdn.microsoft.com/en-us/library/windows/desktop/dd317756(v=vs.85).aspx in order to switch logic depending on it.
You should read this: How can I detect the encoding/codepage of a text file