regex, problem with backreference in pattern with

2019-02-26 04:26发布

i wonder what is the problem with the backreference here:

preg_match_all('/__\((\'|")([^\1]+)\1/', "__('match this') . 'not this'", $matches);

it is expected to match the string between __('') but actually it returns:

match this') . 'not this

any ideas?

4条回答
欢心
2楼-- · 2019-02-26 04:53

Make your regex ungreedy:

preg_match_all('/__((\'|")([^\1]+)\1/U', "__('match this') . 'not this'", $matches)
查看更多
Evening l夕情丶
3楼-- · 2019-02-26 05:00

You can use something like: /__\(("[^"]+"|'[^']+')\)/

查看更多
干净又极端
4楼-- · 2019-02-26 05:04

I'm suprised it didn't give you an unbalance parenthesis error message.

 /
   __
   (
       (\'|")
       ([^\1]+)
       \1
 /

This [^\1] will not take the contents of capture buffer 1 and put it into a character
class. It is the same as all characters that are NOT '1'.

Try this:

/__\(('|").*?\1\).*/

You can add an inner capturing parenthesis to just capture whats between quotes:
/__\(('|")(.*?)\1\).*/

Edit: If no inner delimeter is allowed, use Qtax regex.
Since, ('|").*?\1 even though non-greedy, will still match all up to the trailing anchor. In this case __('all'this'will"match'), and its better to use ('[^']*'|"[^"]*) as

查看更多
Animai°情兽
5楼-- · 2019-02-26 05:13

You can't use a backreference inside a character class because a character class matches exactly one character, and a backreference can potentially match any number of characters, or none.

What you're trying to do requires a negative lookahead, not a negated character class:

preg_match_all('/__\(([\'"])(?:(?!\1).)+\1\)/',
    "__('match this') . 'not this'", $matches);

I also changed your alternation - \'|" - to a character class - [\'"] - because it's much more efficient, and I escaped the outer parentheses to make them match literal parentheses.


EDIT: I guess I need to expand that "more efficient" remark. I took the example Friedl used to demonstrate this point and tested it in RegexBuddy.

Applied to target text abababdedfg,
^[a-g]+$ reports success after three steps, while
^(?:a|b|c|d|e|f|g)+$ takes 55 steps.

And that's for a successful match. When I try it on abababdedfz,
^[a-g]+$ reports failure after 21 steps;
^(?:a|b|c|d|e|f|g)+$ takes 99 steps.

In this particular case the impact on performance is so trivial it's not even worth mentioning. I'm just saying whenever you find yourself choosing between a character class and an alternation that both match the same things, you should almost always go with the character class. Just a rule of thumb.

查看更多
登录 后发表回答