I have a set of markdown files to be passed to jekyll project , need to find the encoding format of them i.e UTF-8 with BOM or UTF-8 without BOM or ANSI using a program or a API .
if i pass the location of the files , the files have to be listed,read and the encoding should be produced as result .
Is there any Code or API for it ?
i have already tried the sr.CurrentEncoding for stream reader as mentioned in Effective way to find any file's Encoding but the result varies with the result from a notepad++ result .
also tried to use https://github.com/errepi/ude ( Mozilla Universal Charset Detector) as suggested in https://social.msdn.microsoft.com/Forums/vstudio/en-US/862e3342-cc88-478f-bca2-e2de6f60d2fb/detect-encoding-of-the-file?forum=csharpgeneral by implementing the ude.dll in the c# project but the result is not effective as in notepad++ , the file encoding is shown as utf-8 , but from the program , the result is utf-8 with BOM.
but i should get same result from both ways , so where the problem has occurred?
Detecting encoding is always a tricky business, but detecting BOMs is dead simple. To get the BOM as byte array, just use the GetPreamble()
function of the encoding objects. This should allow you to detect a whole range of encodings by preamble.
Now, as for detecting UTF-8 without preamble, actually that's not very hard either. See, UTF8 has strict bitwise rules about what values are expected in a valid sequence, and you can initialize a UTF8Encoding object in a way that will fail by throwing an exception when these sequences are incorrect.
So if you first do the BOM check, and then the strict decoding check, and finally fall back to Win-1252 encoding (what you call "ANSI") then your detection is done.
Byte[] bytes = File.ReadAllBytes(filename);
Encoding encoding = null;
String text = null;
// Test UTF8 with BOM. This check can easily be copied and adapted
// to detect many other encodings that use BOMs.
UTF8Encoding encUtf8Bom = new UTF8Encoding(true, true);
Boolean couldBeUtf8 = true;
Byte[] preamble = encUtf8Bom.GetPreamble();
Int32 prLen = preamble.Length;
if (bytes.Length >= prLen && preamble.SequenceEqual(bytes.Take(prLen)))
{
// UTF8 BOM found; use encUtf8Bom to decode.
try
{
// Seems that despite being an encoding with preamble,
// it doesn't actually skip said preamble when decoding...
text = encUtf8Bom.GetString(bytes, prLen, bytes.Length - prLen);
encoding = encUtf8Bom;
}
catch (ArgumentException)
{
// Confirmed as not UTF-8!
couldBeUtf8 = false;
}
}
// use boolean to skip this if it's already confirmed as incorrect UTF-8 decoding.
if (couldBeUtf8 && encoding == null)
{
// test UTF-8 on strict encoding rules. Note that on pure ASCII this will
// succeed as well, since valid ASCII is automatically valid UTF-8.
UTF8Encoding encUtf8NoBom = new UTF8Encoding(false, true);
try
{
text = encUtf8NoBom.GetString(bytes);
encoding = encUtf8NoBom;
}
catch (ArgumentException)
{
// Confirmed as not UTF-8!
}
}
// fall back to default ANSI encoding.
if (encoding == null)
{
encoding = Encoding.GetEncoding(1252);
text = encoding.GetString(bytes);
}
Note that Windows-1252 (US / Western European ANSI) is a one-byte-per-character encoding, meaning everything in it produces a technically valid character, so unless you go for heuristic methods, no further detection can be done on it to distinguish it from other one-byte-per-character encodings.