How to guess the encoding of a file with no BOM in-第2页回答

I'm using the StreamReader class in .NET like this:

using( StreamReader reader = new StreamReader( "c:\somefile.html", true ) {
    string filetext = reader.ReadToEnd();
}

This works fine when the file has a BOM. I ran into trouble with a file with no BOM .. basically I got gibberish. When I specified Encoding.Unicode it worked fine, eg:

using( StreamReader reader = new StreamReader( "c:\somefile.html", Encoding.Unicode, false ) {
    string filetext = reader.ReadToEnd();
}

So, I need to get the file contents into a string. So how do people usually handle this? I know there's no solution that will work 100% of the time, but I'd like to improve my odds .. there is obviously software out there that tries to guess (eg, notepad, browsers, etc). Is there a method in the .NET framework that will guess for me? Does anyone have some code they'd like to share?

More background: This question is pretty much the same as mine, but I'm in .NET land. That question led me to a blog listing various encoding detection libraries, but none are in .NET

标签： c# .net unicode encoding character-encoding

8条回答

Root（大扎）

2楼-- · 2019-01-22 17:02

A hacky technique might be to take an MD5 of the text, then decode the text and re-encode it in various encodings, MD5'ing each one. If one matches you guess it's that encoding.

That's obviously too slow for something that handles a lot of files but for something like a text editor I could see it working.

Other than that, it'll be hands dirty porting the java libraries from this post that came from the Delphi SO question, or using the IE MLang feature.

0人赞添加讨论(0) 举报

迷人小祖宗

3楼-- · 2019-01-22 17:05

Use Win32's IsTextUnicode.

In the general sense, it is a difficult promlem. See: http://blogs.msdn.com/oldnewthing/archive/2007/04/17/2158334.aspx.

0人赞添加讨论(0) 举报

上一页 1 2

How to guess the encoding of a file with no BOM in

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间