How to guess the encoding of a file with no BOM in

I'm using the StreamReader class in .NET like this:

using( StreamReader reader = new StreamReader( "c:\somefile.html", true ) {
    string filetext = reader.ReadToEnd();
}

This works fine when the file has a BOM. I ran into trouble with a file with no BOM .. basically I got gibberish. When I specified Encoding.Unicode it worked fine, eg:

using( StreamReader reader = new StreamReader( "c:\somefile.html", Encoding.Unicode, false ) {
    string filetext = reader.ReadToEnd();
}

So, I need to get the file contents into a string. So how do people usually handle this? I know there's no solution that will work 100% of the time, but I'd like to improve my odds .. there is obviously software out there that tries to guess (eg, notepad, browsers, etc). Is there a method in the .NET framework that will guess for me? Does anyone have some code they'd like to share?

More background: This question is pretty much the same as mine, but I'm in .NET land. That question led me to a blog listing various encoding detection libraries, but none are in .NET

标签： c# .net unicode encoding character-encoding

8条回答

兄弟一词,经得起流年.

2楼-- · 2019-01-22 16:42

I used this to do something similar a while back:

http://www.conceptdevelopment.net/Localization/NCharDet/

0人赞添加讨论(0) 举报

在下西门庆

3楼-- · 2019-01-22 16:47

See my (recent) answer to this (as far as I can tell, equivalent) question: How can I detect the encoding/codepage of a text file

It does NOT attempt to guess across a range of possible "national" encodings like MLang and NCharDet do, but rather assumes you know what kind of non-unicode files you're likely to encounter. As far as I can tell from your question, it should address your problem pretty reliably (without relying on the "black box" of MLang).

0人赞添加讨论(0) 举报

Melony?

4楼-- · 2019-01-22 16:53

Libary http://www.codeproject.com/KB/recipes/DetectEncoding.aspx

And perhaps a useful thread on stackoverflow

0人赞添加讨论(0) 举报

疯言疯语

5楼-- · 2019-01-22 16:55

You should read this article by Raymond Chen. He goes into detail on how programs can guess what an encoding is (and some of the fun that comes from guessing)

http://blogs.msdn.com/oldnewthing/archive/2004/03/24/95235.aspx

0人赞添加讨论(0) 举报

可以哭但决不认输i

6楼-- · 2019-01-22 16:57

I had good luck with Pude, a C# port of Mozilla Universal Charset Detector.

0人赞添加讨论(0) 举报

贪生不怕死

7楼-- · 2019-01-22 16:58

UTF-8 is designed in a way that it is unlikely to have a text encoded in an arbitrary 8bit-encoding like latin1 being decoded to proper unicode using UTF-8.

So the minimum approach is this (pseudocode, I don't talk .NET):

try: u = some_text.decode("UTF-8") except UnicodeDecodeError: u = some_text.decode("most-likely-encoding")

For the most-likely-encoding one usually uses e.g. latin1 or cp1252 or whatever. More sophisticated approaches might try & find language-specific character pairings, but I'm not aware of something that does that as a library or some such.

0人赞添加讨论(0) 举报

1 2 下一页

How to guess the encoding of a file with no BOM in

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间