UB: C#'s Regex.Match returns whole string inst

2020-04-11 06:09发布

问题:

Attention! This is NOT related to Regex problem, matches the whole string instead of a part


Hi all. I try to do

Match y = Regex.Match(someHebrewContainingLine, @"^.{0,9} - \[(.*)?\s\d{1,3}");

Aside from the other VS hebrew quirks (how do you like replacing ] for [ when editing the string?), it occasionally returns the crazy results:

Match.Captures.Count = 1;
Match.Captures[0] = whole string! (not expected)
Match.Groups.Count = 2; (not expected)
Match.Groups[0] = whole string again! (not expected)
Match.Groups[1] = (.*)? value (expected).

Regex.Matches() is acting same way.

What can be a general reason for such behaviour? Note: it's not acting this way on a simple test strings like Regex.Match("-היי45--", "-(.{1,5})-") (sample is displayed incorrectly!, please look to the page's source code), there must be something with the regex which makes it greedy. The matched string contains [ .... ], but simply adding them to test string doesn't causes the same effect.

回答1:

I hit this problem when I first started using the .NET regex, too. The way to understand this is to understand that the Group member of Match is the nesting member. You have to traverse Groups in order to get down to lower captures. Groups also have Capture members. The Match is kind of like the top "Group" in that it represents the successful "match" of the whole string against your expression. The single input string can have multiple matches. The Captures member represents the match of your full expression.

Whenever you have a single capture as you have, Group[1] will always be the data you are interested in. Look at this page. The source code in examples 2 and 3 is hardcoded to print out Groups[1].

Remember that a single capture can capture multiple substrings in a single match operation. If this were the case then you would see Match.Groups[1].Captures.Count be greater than 1. Also, I think if you passed in multiple matching lines of text to the single Match call, then you would see Match.Captures.Count be greater than 1, but each top-level Match.Captures would be the full string matched by your full expression.



回答2:

There is one capture group in the pattern; that is group 1.

There is always group 0, which is the entire match.

Therefore there are a total of 2 groups.



回答3:

My test regex was different from any others in the project's scope (thats what happens when Perl guy comes to C#), as it had no lookaheads/lookbehinds. So this discovery took some time.

Now, why we should call Regex behaviour undocumented, not undefined:

let's do some matches against "1.234567890".

  • PCRE-like syntax: (.)\.2345678
  • lookahead syntax: (.)(?=\.\d)

When you're doing a normal match, the result is copied from whole matched part of line, no matter where you've put the parentesizes; in case of lookaheads present, anything that did not belongs to them is copied.

So, the matches will return:

  • PCRE: 1.2345678 (at 2300, this looks like original string and I start yelling here at SO)
  • lookahead: 1