regular expression to parse links from html code

I'm working on a method that accepts a string (html code) and returns an array that contains all the links contained with in.

I've seen a few options for things like html ability pack but It seems a little more complicated than this project calls for

I'm also interested in using regular expression because i don't have much experience with it in general and i think this would be a good learning opportunity.

My code thus far is

 WebClient client = new WebClient();
            string htmlCode = client.DownloadString(p);
            Regex exp = new Regex(@"http://(www\.)?([^\.]+)\.com", RegexOptions.IgnoreCase);
            string[] test = exp.Split(htmlCode);

but I'm not getting the results I want because I'm still working on the regular expression

sudo code for what I'm looking for is "

标签： c# html regex parsing hyperlink

4条回答

再贱就再见

2楼-- · 2019-02-24 06:20

You could look for anything that is sort-of-like a url for http/https schema. This is not HTML proof, but it will get you things that looks like http URLs, which is what you need, I suspect. You can add more sachems, and domains.
The regex looks for things that look like URL "in" href attributes (not strictly).

class Program {
    static void Main(string[] args) {
        const string pattern = @"href=[""'](?<url>(http|https)://[^/]*?\.(com|org|net|gov))(/.*)?[""']";
        var regex = new Regex(pattern);
        var urls = new string[] { 
            "href='http://company.com'",
            "href=\"https://company.com\"",
            "href='http://company.org'",
            "href='http://company.org/'",
            "href='http://company.org/path'",
        };

        foreach (var url in urls) {
            Match match = regex.Match(url);
            if (match.Success) {
                Console.WriteLine("{0} -> {1}", url, match.Groups["url"].Value);
            }
        }
    }
}

output:

href='http://company.com' -> http://company.com
href="https://company.com" -> https://company.com
href='http://company.org' -> http://company.org
href='http://company.org/' -> http://company.org
href='http://company.org/path' -> http://company.org

0人赞添加讨论(0) 举报

叼着烟拽天下

3楼-- · 2019-02-24 06:28

If you are looking for a fool proof solution regular expressions are not your answers. They are fundamentally limited and cannot be used to reliably parse out links, or other tags for that matter, from an HTML file due to the complexity of the HTML language.

Long Winded Version: http://blogs.msdn.com/jaredpar/archive/2008/10/15/regular-expression-limitations.aspx

Instead you'll need to use an actual HTML DOM API to parse out links.

0人赞添加讨论(0) 举报

可以哭但决不认输i

4楼-- · 2019-02-24 06:34

Regular Expressions are not the best idea for HTML.

see previous questions:

Rather, you want something that already knows how to parse the DOM; otherwise, you're re-inventing the wheel.

0人赞添加讨论(0) 举报

beautiful°

5楼-- · 2019-02-24 06:44

Other users may tell you "No, Stop! Regular expressions should not mix with HTML! It's like mixing bleach and ammonia!". There is a lot of wisdom in that advice, but it's not the full story.

The truth is that regular expressions work just fine for collecting commonly formatted links. However, a better approach would be to use a dedicated tool for this type of thing, such as the HtmlAgilityPack.

If you use regular expressions, you may match 99.9% of the links, but you may miss on rare unanticipated corner cases or malformed html data.

Here's a function I put together that uses the HtmlAgilityPack to meet your requirements:

    private static IEnumerable<string> DocumentLinks(string sourceHtml)
    {
        HtmlDocument sourceDocument = new HtmlDocument();

        sourceDocument.LoadHtml(sourceHtml);

        return (IEnumerable<string>)sourceDocument.DocumentNode
            .SelectNodes("//a[@href!='#']")
                .Select(n => n.GetAttributeValue("href",""));

    }

This function creates a new HtmlAgilityPack.HtmlDocument, loads a string containing HTML into it, and then uses an xpath query "//a[@href!='#']" to select all of the links on the page that do not point to "#". Then I use the LINQ extension Select to convert the HtmlNodeCollection into a list of strings containing the value of the href attribute - where the link is pointing to.

Here's an example use:

        List<string> links = 
            DocumentLinks((new WebClient())
                .DownloadString("http://google.com")).ToList();

        Debugger.Break();

This should be a lot more effective than regular expressions.

0人赞添加讨论(0) 举报

regular expression to parse links from html code

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间