Get URL from HTML code using a regular expression

2019-07-23 18:02发布

Consider:

<div><a href="http://anirudhagupta.blogspot.com/">Anirudha Web blog</a></div>

What is the regular expression to get http://anirudhagupta.blogspot.com/ from the following?

<div><a href="http://anirudhagupta.blogspot.com/">Anirudha Web blog</a></div>

If you suggest something in C# that's good. I also like jQuery to do this.

5条回答
ゆ 、 Hurt°
2楼-- · 2019-07-23 18:48

You don't need a complicated regular expression or HTML parser, since you only want to extract links. Here's a generic way to do it.

data="""
<html>
abcd ef ....
blah blah <div><a href="http://anirudhagupta.blogspot.com/">Anirudha Web blog</a></div>
blah  ...
<div><a href="http://mike.blogspot.com/">Mike's Web blog
</a></div>
end...
</html>
"""    
for item in data.split("</a>"):
    if "<a href" in item:
        start_of_href = item.index("<a href") # get where <a href=" is
        print item[start_of_href+len('<a href="'):] # print substring from <a href onwards. 

The above is Python code, but the idea behind you can adapt in your C# language. Split your HTML string using "</a>" as delimiter. Go through each split field, check for "href", then get the substr after "href". That will be your links.

查看更多
一夜七次
3楼-- · 2019-07-23 18:49

If you want to use jQuery you can do the following.

$('a').attr('href')
查看更多
冷血范
4楼-- · 2019-07-23 18:54

The right way to do this is to load the HTML into the C# XML parser and then use XPath to query the URLs. This way you don't have to worry about parsing at all.

查看更多
何必那么认真
5楼-- · 2019-07-23 18:55

The simplest way to do this is using the following regular expression.

/href="([^"]+)"/

This will get all characters from the first quote until it finds a character that is a quote. This is, in most languages, the fastest way to get a quoted string, that can't itself contain quotes. Quotes should be encoded when used in attributes.

UPDATE: A complete Perl program for parsing URLs would look like this:

use 5.010;

while (<>) {
    push @matches, m/href="([^"]+)"/gi;
    push @matches, m/href='([^']+)'/gi;
    push @matches, m/href=([^"'][^>\s]*)[>\s]+/gi;
    say for @matches;
}

It reads from stdin and prints all URLs. It takes care of the three possible quotes. Use it with curl to find all the URLs in a webpage:

curl url | perl urls.pl
查看更多
干净又极端
6楼-- · 2019-07-23 19:03

Quick and dirty:

href="(.*?)"

Ok, let's go with another regex for parsing URLs. This comes from RFC 2396 - URI Generic Syntax: Parsing a URI Reference with a Regular Expression

^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?

Of course, you can have relative URL address into your HTML code, you'll need to address them in another way; I can recommend you to use C# Uri Constructor (Uri, String).

查看更多
登录 后发表回答