Regular expression to find URLs not inside a hyper

2019-05-12 09:28发布

问题:

There's many regex's out there to match a URL. However, I'm trying to match URLs that do not appear anywhere within a <a> hyperlink tag (HREF, inner value, etc.). So NONE of the URLs in these should match:

<a href="http://www.example.com/">something</a>
<a href="http://www.example.com/">http://www.example2.com</a>
<a href="http://www.example.com/"><b>something</b>http://www.example.com/<span>test</span></a>

Any URL outside of <a></a> should be matched.

One approach I tried was to use a negative lookahead to see if the first <a> tag after the URL was an opening <a> or a closing </a>. If it is a closing </a> then the URL must be inside a hyperlink. I think this idea was okay, but the negative lookahead regex didn't work (or more accurately, the regex wasn't written correctly). Any tips are very appreciated.

回答1:

You can do it in two steps instead of trying to come up with a single regular expression:

  1. Blend out (replace with nothing) the HTML anchor part (the entire anchor tag: opening tag, content and closing tag).

  2. Match the URL

In Perl it could be:

my $curLine = $_; #Do not change $_ if it is needed for something else.
$curLine =~ /<a[^<]+<\/a>//g; #Remove all of HTML anchor tag, "<a", "</a>" and everything in between.
if ( $curLine =~ /http:\/\//)
{
  print "Matched an URL outside a HTML anchor !: $_\n";
}


回答2:

You can do that using a single regular expression that matches both anchor tags and hyperlinks:

# Note that this is a dummy, you'll need a more sophisticated URL regex
regex = '(<a[^>]+>)|(http://.*)'

Then loop over the results and only process matches where the second sub-pattern was found.



回答3:

Peter has a great answer: first, remove anchors so that

Some text <a href="http://page.net">TeXt</a> and some more text with link http://a.net

is replaced by

Some text  and some more text with link http://a.net

THEN run a regexp that finds urls:

http://a.net


回答4:

Use the DOM to filter out the anchor elements, then do a simple URL regex on the rest.



标签: html regex url