C# and HtmlAgilityPack encoding problem

2020-02-10 01:23发布

WebClient GodLikeClient = new WebClient();
HtmlAgilityPack.HtmlDocument GodLikeHTML = new HtmlAgilityPack.HtmlDocument();

GodLikeHTML.Load(GodLikeClient.OpenRead("www.alfa.lt");

So this code returns: "Skaitytojo klausimas psichologui: kas lemia homoseksualumÄ…? - Naujienų portalas Alfa.lt" instead of "Skaitytojo klausimas psichologui: kas lemia homoseksualumą? - Naujienų portalas Alfa.lt".

This webpage is encoded in 1257 (baltic), but textBox1.Text = GodLikeHTML.DocumentNode.OuterHtml; returns the distorted text - baltic diacritics are transformed into some weird several characters long strings :(

And yes, I've tried the HtmlAgilityPack forums. They do suck.

P.S. I'm no programmer, but I work on a community project and I really need to get this code working. Thanks ;}

7条回答
何必那么认真
2楼-- · 2020-02-10 01:59

I had a similar encoding problems. I fixed it, in the most current version of HtmlAgilityPack, by adding the following to my WebClient initialization.

var htmlWeb = new HtmlWeb();
htmlWeb.OverrideEncoding = Encoding.UTF8;
var doc = htmlWeb.Load("www.alfa.lt");
查看更多
叼着烟拽天下
3楼-- · 2020-02-10 01:59
 HtmlAgilityPack.HtmlDocument doc = new HtmlDocument(); 
 StreamReader reader = new StreamReader(WebRequest.Create(YourUrl).GetResponse().GetResponseStream(), Encoding.Default); //put your encoding            
 doc.Load(reader);

hope it helps :)

查看更多
欢心
4楼-- · 2020-02-10 02:07

This is my solution

 HttpWebRequest request =(HttpWebRequest)WebRequest.Create("http://www.sina.com.cn");
HttpWebResponse response =(HttpWebResponse)request.GetResponse();
long len = response.ContentLength;
byte[] barr = new byte[len]; 
response.GetResponseStream().Read(barr, 0, (int)len); 
response.Close();
string data = Encoding.UTF8.GetString(barr); 
var encod = doc.DetectEncodingHtml(data);
string convstr = Encoding.Unicode.GetString(Encoding.Convert(encod, Encoding.Unicode, barr));
doc.LoadHtml(convstr);
查看更多
仙女界的扛把子
5楼-- · 2020-02-10 02:07

if all of those post doesn't work, Just use this: WebUtility.HtmlDecode("Your html text");

查看更多
乱世女痞
6楼-- · 2020-02-10 02:12

try to change that to GodLikeHTML.Load(GodLikeClient.OpenRead("www.alfa.lt"), Encoding.GetEncoding(1257));

查看更多
干净又极端
7楼-- · 2020-02-10 02:15

UTF8 didn't work for me, but after setting the encoding like this, most pages i was trying to scrape worked just wel:

web.OverrideEncoding = Encoding.GetEncoding("ISO-8859-1");

Perhaps it might help someone.

查看更多
登录 后发表回答