Parsing HTML to get content using C#

I am writing an application that crawls a group of my web pages. Rather than take the entire source code of the page I'd like to take all of the content and store that and be able to store the page as plain text within a database. The content will be used in other applications and not read by users so there's no need for it to be perfectly human-readable.

At first, I was thinking of using regular expressions, but I have no control over the validity of the web pages and there is a great chance that no regular expression would give me the content.

If I have the source code within a string, how can I turn that string of source code into just the content in C#?

标签： c# string html-parsing

4条回答

兄弟一词,经得起流年.

2楼-- · 2019-01-06 20:45

Below function will help to remove all HTML tags, scripts, css, styles from html string and convert it to a plain text. view source

private string GetPlainTextFromHtml(string htmlString)
{
    string htmlTagPattern = "<.*?>";
    var regexCss = new Regex("(\\<script(.+?)\\</script\\>)|(\\<style(.+?)\\</style\\>)", RegexOptions.Singleline | RegexOptions.IgnoreCase);
    htmlString = regexCss.Replace(htmlString, string.Empty);
    htmlString = Regex.Replace(htmlString, htmlTagPattern, string.Empty);
    htmlString = Regex.Replace(htmlString, @"^\s+$[\r\n]*", "", RegexOptions.Multiline);
    htmlString = htmlString.Replace("&nbsp;", string.Empty);

    return htmlString;
}

0人赞添加讨论(0) 举报

beautiful°

3楼-- · 2019-01-06 20:47

Please, please do not parse HTML yourself! You cannot use just a standard regex to parse HTML - it's not possible.

There are tons of free libraries out there. One of the best free ones in the world of .NET is the HTML Agility Pack.

HTML Agility Pack supports malformed documents as well, which is something that a regex or other basic parsing such as XML will almost never do.

0人赞添加讨论(0) 举报

一纸荒年 Trace。

4楼-- · 2019-01-06 20:52

I wrote code to strip out the raw text from markup and present it in my article Convert HTML to Text. The code presented is pretty simple and lightweight.

I also wrote a lightweight HTML parser and have posted it on Github as HTML Monkey. This would be a more complete solution and it would be a simple task to convert the parsed markup to get only the text. I'm still working on this project and am looking for feedback on how it works.

0人赞添加讨论(0) 举报

姐就是有狂的资本

5楼-- · 2019-01-06 20:54

It isn't 100% clear what you want, but I'm assuming you want the text minus markup; so:

string html;
// obtain some arbitrary html....
using (var client = new WebClient()) {
    html = client.DownloadString("http://stackoverflow.com/questions/2038104");
}
// use the html agility pack: http://www.codeplex.com/htmlagilitypack
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
StringBuilder sb = new StringBuilder();
foreach (HtmlTextNode node in doc.DocumentNode.SelectNodes("//text()")) {
    sb.AppendLine(node.Text);
}
string final = sb.ToString();

0人赞添加讨论(0) 举报

Parsing HTML to get content using C#

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间