Regular expression to find URLs not inside a hyper

There's many regex's out there to match a URL. However, I'm trying to match URLs that do not appear anywhere within a <a> hyperlink tag (HREF, inner value, etc.). So NONE of the URLs in these should match:

<a href="http://www.example.com/">something</a>
<a href="http://www.example.com/">http://www.example2.com</a>
<a href="http://www.example.com/"><b>something</b>http://www.example.com/<span>test</span></a>

Any URL outside of <a></a> should be matched.

One approach I tried was to use a negative lookahead to see if the first <a> tag after the URL was an opening <a> or a closing </a>. If it is a closing </a> then the URL must be inside a hyperlink. I think this idea was okay, but the negative lookahead regex didn't work (or more accurately, the regex wasn't written correctly). Any tips are very appreciated.

标签： html regex url

4条回答

趁早两清

2楼-- · 2019-05-12 09:25

You can do it in two steps instead of trying to come up with a single regular expression:

Blend out (replace with nothing) the HTML anchor part (the entire anchor tag: opening tag, content and closing tag).
Match the URL

In Perl it could be:

my $curLine = $_; #Do not change $_ if it is needed for something else.
$curLine =~ /<a[^<]+<\/a>//g; #Remove all of HTML anchor tag, "<a", "</a>" and everything in between.
if ( $curLine =~ /http:\/\//)
{
  print "Matched an URL outside a HTML anchor !: $_\n";
}

0人赞添加讨论(0) 举报

放我归山

3楼-- · 2019-05-12 09:26

You can do that using a single regular expression that matches both anchor tags and hyperlinks:

# Note that this is a dummy, you'll need a more sophisticated URL regex
regex = '(<a[^>]+>)|(http://.*)'

Then loop over the results and only process matches where the second sub-pattern was found.

0人赞添加讨论(0) 举报

Lonely孤独者°

4楼-- · 2019-05-12 09:27

Use the DOM to filter out the anchor elements, then do a simple URL regex on the rest.

0人赞添加讨论(0) 举报

来，给爷笑一个

5楼-- · 2019-05-12 09:45

Peter has a great answer: first, remove anchors so that

Some text <a href="http://page.net">TeXt</a> and some more text with link http://a.net

is replaced by

Some text  and some more text with link http://a.net

THEN run a regexp that finds urls:

http://a.net

0人赞添加讨论(0) 举报

Regular expression to find URLs not inside a hyper

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间