There's many regex's out there to match a URL. However, I'm trying to match URLs that do not appear anywhere within a <a>
hyperlink tag (HREF
, inner value, etc.). So NONE of the URLs in these should match:
<a href="http://www.example.com/">something</a>
<a href="http://www.example.com/">http://www.example2.com</a>
<a href="http://www.example.com/"><b>something</b>http://www.example.com/<span>test</span></a>
Any URL outside of <a></a>
should be matched.
One approach I tried was to use a negative lookahead to see if the first <a>
tag after the URL was an opening <a>
or a closing </a>
. If it is a closing </a>
then the URL must be inside a hyperlink. I think this idea was okay, but the negative lookahead regex didn't work (or more accurately, the regex wasn't written correctly). Any tips are very appreciated.
You can do it in two steps instead of trying to come up with a single regular expression:
Blend out (replace with nothing) the HTML anchor part (the entire anchor tag: opening tag, content and closing tag).
Match the URL
In Perl it could be:
my $curLine = $_; #Do not change $_ if it is needed for something else.
$curLine =~ /<a[^<]+<\/a>//g; #Remove all of HTML anchor tag, "<a", "</a>" and everything in between.
if ( $curLine =~ /http:\/\//)
{
print "Matched an URL outside a HTML anchor !: $_\n";
}
You can do that using a single regular expression that matches both anchor tags and hyperlinks:
# Note that this is a dummy, you'll need a more sophisticated URL regex
regex = '(<a[^>]+>)|(http://.*)'
Then loop over the results and only process matches where the second sub-pattern was found.
Peter has a great answer: first, remove anchors so that
Some text <a href="http://page.net">TeXt</a> and some more text with link http://a.net
is replaced by
Some text and some more text with link http://a.net
THEN run a regexp that finds urls:
http://a.net
Use the DOM to filter out the anchor elements, then do a simple URL regex on the rest.