Regex non-capturing group is capturing

I have this regex

(?:\<a[^*]href="(http://[^"]+?|[^"]+?\.pdf)"+?[^>]*?)>

The point of this regex is to capture every closing tag ('>') of an anchor that has an href that starts with "http://" or ends with ".pdf".

The regex works, however it is capturing the first part of the anchor, which I absolutely need to NOT capture.

In the following samples all are matching except second (which is fine) but only the last bracket should be captured and it is not the case.

<a href="http://blabla">omg</a>
<a href="blabla">omg</a>
<a href="http://blabla.pdf">omg</a>
<a href="/blabla.pdf">omg</a>

For example: If we take the first match which is :

<a href="http://blabla">

I only want to capture the last bracket (the one I surounded with parenthesis) :

<a href="http://blabla"(>)

So why the non-capturing group is capturing? And how can I only grab the last bracket of the anchor

Even if I streamline my regex to the following, it still doesnt work

(?:\<a[^*]href="http://[^"]+"+[^>]*)(>)

Thank you,

标签： html regex anchor

5条回答

在下西门庆

2楼-- · 2020-04-21 08:58

If I'm understanding correctly that you want to match just the greater-than sign (>) that's part of the closing anchor tag, this should do it:

\<a[^*]href="(http://[^"]+?|[^"]+?\.pdf)"+?[^>]*?(>)

0人赞添加讨论(0) 举报

走好不送

3楼-- · 2020-04-21 08:59

You're conflating two distinct concepts: capturing and consuming. Regexes normally consume whatever they match; that's just how they work. Additionally, most regex flavors let you use capturing groups to pluck out specific parts of the overall match. (The overall match is often referred to as the zero'th capturing group, but that's just a figure of speech.)

It sounds like you're trying to match a whole <A> tag, but only consume the final >. That's not possible in most regex flavors, JavaScript included. But if you're using Perl or PHP, you could use \K to spoof the match start position:

(?i)<a\s+[^>]+?href="http://[^"]+"[^>]*\K>

And in .NET you could use a lookbehind (which, like a lookahead, matches without consuming):

(?i)"(?<=<a\s+[^>]+?href="http://[^"]+"[^>]*)>

Of the other flavors that support lookbehinds, most place restrictions on them that render them unusable for this task.

0人赞添加讨论(0) 举报

混吃等死

4楼-- · 2020-04-21 09:10

If I'm understanding your request correctly...

\<a[^*]href="(?:http://[^"]+?|[^"]+?\.pdf)"+?[^>]*?(>)

0人赞添加讨论(0) 举报

Bombasti

5楼-- · 2020-04-21 09:16

Rewrite your regex as :

(?:\<a[^*]href="(?:http://[^"]+?|[^"]+?\.pdf)"+?[^>]*?)(>)
   non capture __^^                                    ^ ^
                                             capture __|_|

As Tony Lukasavage said, there is an unnecessary non-capture group, and, moreover, there is no need to escape <, so it becomes:

  <a[^*]href="(?:http://[^"]+?|[^"]+?\.pdf)"+?[^>]*?(>)
non capture __^^                                    ^ ^
                                          capture __|_|

0人赞添加讨论(0) 举报

男人必须洒脱

6楼-- · 2020-04-21 09:16

Your parentheses are around the tag itself and the href's contents, so that's what will be captured. If you need to capture the closing > then put the parenthesis around it.

0人赞添加讨论(0) 举报

Regex non-capturing group is capturing

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间