Consider:
<div><a href="http://anirudhagupta.blogspot.com/">Anirudha Web blog</a></div>
What is the regular expression to get http://anirudhagupta.blogspot.com/
from the following?
<div><a href="http://anirudhagupta.blogspot.com/">Anirudha Web blog</a></div>
If you suggest something in C# that's good. I also like jQuery to do this.
You don't need a complicated regular expression or HTML parser, since you only want to extract links. Here's a generic way to do it.
The above is Python code, but the idea behind you can adapt in your C# language. Split your HTML string using
"</a>"
as delimiter. Go through each split field, check for"href"
, then get the substr after"href"
. That will be your links.If you want to use jQuery you can do the following.
The right way to do this is to load the HTML into the C# XML parser and then use XPath to query the URLs. This way you don't have to worry about parsing at all.
The simplest way to do this is using the following regular expression.
This will get all characters from the first quote until it finds a character that is a quote. This is, in most languages, the fastest way to get a quoted string, that can't itself contain quotes. Quotes should be encoded when used in attributes.
UPDATE: A complete Perl program for parsing URLs would look like this:
It reads from stdin and prints all URLs. It takes care of the three possible quotes. Use it with
curl
to find all the URLs in a webpage:Quick and dirty:
Ok, let's go with another regex for parsing URLs. This comes from RFC 2396 - URI Generic Syntax: Parsing a URI Reference with a Regular Expression
Of course, you can have relative URL address into your HTML code, you'll need to address them in another way; I can recommend you to use C#
Uri Constructor (Uri, String)
.