Do backreferences need to come after the group the

2019-02-23 17:42发布

While running some tests for this answer, I noticed the following unexpected behavior. This will remove all occurrences of <tag> after the first:

var input = "<text><text>extra<words><text><words><something>";
Regex.Replace(input, @"(<[^>]+>)(?<=\1.*\1)", "");
// <text>extra<words><something>

But this will not:

Regex.Replace(input, @"(?<=\1.*)(<[^>]+>)", "");
// <text><text>extra<words><text><words><something>

Similarly, this will remove all occurences of <tag> before the last:

Regex.Replace(input, @"(<[^>]+>)(?=.*\1)", "");
// extra<text><words><something>

But this will not:

Regex.Replace(input, @"(?=\1.*\1)(<[^>]+>)", "");
// <text><text>extra<words><text><words><something>

So this got me thinking…

In the .NET regular expression engine, does a backreference need to appear after the group it's referencing? Or is there something else going on with these patterns that's causing them not to work?

1条回答
淡お忘
2楼-- · 2019-02-23 17:52

Your question got me thinking too, so I ran a few tests with RegexBuddy and to my surprise the second regex (?<=\1.*)(<[^>]+>) which you said didn't work actually worked and the others worked exactly like you said. I then tried the same expression - the second one - in C# code but it didn't work like what happened with you.

This got me confused, then I noticed that my RegexBuddy version dates back to 2008 so there must have been some change in how the .NET engine works, but this shed the light on a fact I though is rational, it seems that before 2008 lookbehinds were evaluated after the rest of the expression matched. I felt this behavior is a bit acceptable with lookbehinds since you need to match something before you look behind to match something before it.

Nevertheless, the engines these days seem to evaluate lookarounds when it encounters them and I was able to find this out by using the following expression which is like the reverse situation of your case:

(?<=(\w))\1

As you can see I captured a word character inside the regex and referenced it outside it. I tested this on the string hello and it matched at the second l character as expected and this proves that the lookbehind was executed before attempting to match the rest of the expression.

Conclusion: Yes, a back reference need to appear after the group it references or it will have no match semantics.

查看更多
登录 后发表回答