Consider:
<div><a href="http://anirudhagupta.blogspot.com/">Anirudha Web blog</a></div>
What is the regular expression to get http://anirudhagupta.blogspot.com/
from the following?
<div><a href="http://anirudhagupta.blogspot.com/">Anirudha Web blog</a></div>
If you suggest something in C# that's good. I also like jQuery to do this.
Quick and dirty:
href="(.*?)"
Ok, let's go with another regex for parsing URLs. This comes from RFC 2396 - URI Generic Syntax: Parsing a URI Reference with a Regular Expression
^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
Of course, you can have relative URL address into your HTML code, you'll need to address them in another way; I can recommend you to use C# Uri Constructor (Uri, String)
.
If you want to use jQuery you can do the following.
$('a').attr('href')
The simplest way to do this is using the following regular expression.
/href="([^"]+)"/
This will get all characters from the first quote until it finds a character that is a quote. This is, in most languages, the fastest way to get a quoted string, that can't itself contain quotes. Quotes should be encoded when used in attributes.
UPDATE: A complete Perl program for parsing URLs would look like this:
use 5.010;
while (<>) {
push @matches, m/href="([^"]+)"/gi;
push @matches, m/href='([^']+)'/gi;
push @matches, m/href=([^"'][^>\s]*)[>\s]+/gi;
say for @matches;
}
It reads from stdin and prints all URLs. It takes care of the three possible quotes. Use it with curl
to find all the URLs in a webpage:
curl url | perl urls.pl
The right way to do this is to load the HTML into the C# XML parser and then use XPath to query the URLs. This way you don't have to worry about parsing at all.
You don't need a complicated regular expression or HTML parser, since you only want to extract links. Here's a generic way to do it.
data="""
<html>
abcd ef ....
blah blah <div><a href="http://anirudhagupta.blogspot.com/">Anirudha Web blog</a></div>
blah ...
<div><a href="http://mike.blogspot.com/">Mike's Web blog
</a></div>
end...
</html>
"""
for item in data.split("</a>"):
if "<a href" in item:
start_of_href = item.index("<a href") # get where <a href=" is
print item[start_of_href+len('<a href="'):] # print substring from <a href onwards.
The above is Python code, but the idea behind you can adapt in your C# language. Split your HTML string using "</a>"
as delimiter. Go through each split field, check for "href"
, then get the substr after "href"
. That will be your links.