There's many regex's out there to match a URL. However, I'm trying to match URLs that do not appear anywhere within a <a>
hyperlink tag (HREF
, inner value, etc.). So NONE of the URLs in these should match:
<a href="http://www.example.com/">something</a> <a href="http://www.example.com/">http://www.example2.com</a> <a href="http://www.example.com/"><b>something</b>http://www.example.com/<span>test</span></a>
Any URL outside of <a></a>
should be matched.
One approach I tried was to use a negative lookahead to see if the first <a>
tag after the URL was an opening <a>
or a closing </a>
. If it is a closing </a>
then the URL must be inside a hyperlink. I think this idea was okay, but the negative lookahead regex didn't work (or more accurately, the regex wasn't written correctly). Any tips are very appreciated.
You can do it in two steps instead of trying to come up with a single regular expression:
Blend out (replace with nothing) the HTML anchor part (the entire anchor tag: opening tag, content and closing tag).
Match the URL
In Perl it could be:
You can do that using a single regular expression that matches both anchor tags and hyperlinks:
Then loop over the results and only process matches where the second sub-pattern was found.
Use the DOM to filter out the anchor elements, then do a simple URL regex on the rest.
Peter has a great answer: first, remove anchors so that
is replaced by
THEN run a regexp that finds urls: