Get URL from HTML code using a regular expression

Consider:

<div><a href="http://anirudhagupta.blogspot.com/">Anirudha Web blog</a></div>

What is the regular expression to get http://anirudhagupta.blogspot.com/ from the following?

<div><a href="http://anirudhagupta.blogspot.com/">Anirudha Web blog</a></div>

If you suggest something in C# that's good. I also like jQuery to do this.

标签： c# asp.net regex asp.net-mvc url

5条回答

ゆ、 Hurt°

2楼-- · 2019-07-23 18:48

You don't need a complicated regular expression or HTML parser, since you only want to extract links. Here's a generic way to do it.

data="""
<html>
abcd ef ....
blah blah <div><a href="http://anirudhagupta.blogspot.com/">Anirudha Web blog</a></div>
blah  ...
<div><a href="http://mike.blogspot.com/">Mike's Web blog
</a></div>
end...
</html>
"""    
for item in data.split("</a>"):
    if "<a href" in item:
        start_of_href = item.index("<a href") # get where <a href=" is
        print item[start_of_href+len('<a href="'):] # print substring from <a href onwards.

The above is Python code, but the idea behind you can adapt in your C# language. Split your HTML string using "</a>" as delimiter. Go through each split field, check for "href", then get the substr after "href". That will be your links.

0人赞添加讨论(0) 举报

一夜七次

3楼-- · 2019-07-23 18:49

If you want to use jQuery you can do the following.

$('a').attr('href')

0人赞添加讨论(0) 举报

冷血范

4楼-- · 2019-07-23 18:54

The right way to do this is to load the HTML into the C# XML parser and then use XPath to query the URLs. This way you don't have to worry about parsing at all.

0人赞添加讨论(0) 举报

何必那么认真

5楼-- · 2019-07-23 18:55

The simplest way to do this is using the following regular expression.

/href="([^"]+)"/

This will get all characters from the first quote until it finds a character that is a quote. This is, in most languages, the fastest way to get a quoted string, that can't itself contain quotes. Quotes should be encoded when used in attributes.

UPDATE: A complete Perl program for parsing URLs would look like this:

use 5.010;

while (<>) {
    push @matches, m/href="([^"]+)"/gi;
    push @matches, m/href='([^']+)'/gi;
    push @matches, m/href=([^"'][^>\s]*)[>\s]+/gi;
    say for @matches;
}

It reads from stdin and prints all URLs. It takes care of the three possible quotes. Use it with curl to find all the URLs in a webpage:

curl url | perl urls.pl

0人赞添加讨论(0) 举报

干净又极端

6楼-- · 2019-07-23 19:03

Quick and dirty:

href="(.*?)"

Ok, let's go with another regex for parsing URLs. This comes from RFC 2396 - URI Generic Syntax: Parsing a URI Reference with a Regular Expression

^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?

Of course, you can have relative URL address into your HTML code, you'll need to address them in another way; I can recommend you to use C# Uri Constructor (Uri, String).

0人赞添加讨论(0) 举报

Get URL from HTML code using a regular expression

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间