Attention! This is NOT related to Regex problem, matches the whole string instead of a part
Hi all.
I try to do
Match y = Regex.Match(someHebrewContainingLine, @"^.{0,9} - \[(.*)?\s\d{1,3}");
Aside from the other VS hebrew quirks (how do you like replacing ] for [ when editing the string?), it occasionally returns the crazy results:
Match.Captures.Count = 1;
Match.Captures[0] = whole string! (not expected)
Match.Groups.Count = 2; (not expected)
Match.Groups[0] = whole string again! (not expected)
Match.Groups[1] = (.*)? value (expected).
Regex.Matches()
is acting same way.
What can be a general reason for such behaviour? Note: it's not acting this way on a simple test strings like Regex.Match("-היי45--", "-(.{1,5})-")
(sample is displayed incorrectly!, please look to the page's source code), there must be something with the regex which makes it greedy. The matched string contains [ .... ]
, but simply adding them to test string doesn't causes the same effect.
I hit this problem when I first started using the .NET regex, too. The way to understand this is to understand that the Group
member of Match
is the nesting member. You have to traverse Groups
in order to get down to lower captures. Groups also have Capture
members. The Match
is kind of like the top "Group" in that it represents the successful "match" of the whole string against your expression. The single input string can have multiple matches. The Captures
member represents the match of your full expression.
Whenever you have a single capture as you have, Group[1]
will always be the data you are interested in. Look at this page. The source code in examples 2 and 3 is hardcoded to print out Groups[1]
.
Remember that a single capture can capture multiple substrings in a single match operation. If this were the case then you would see Match.Groups[1].Captures.Count
be greater than 1. Also, I think if you passed in multiple matching lines of text to the single Match
call, then you would see Match.Captures.Count
be greater than 1, but each top-level Match.Captures
would be the full string matched by your full expression.
There is one capture group in the pattern; that is group 1.
There is always group 0, which is the entire match.
Therefore there are a total of 2 groups.
My test regex was different from any others in the project's scope (thats what happens when Perl guy comes to C#), as it had no lookaheads/lookbehinds. So this discovery took some time.
Now, why we should call Regex behaviour undocumented, not undefined:
let's do some matches against "1.234567890"
.
- PCRE-like syntax:
(.)\.2345678
- lookahead syntax:
(.)(?=\.\d)
When you're doing a normal match, the result is copied from whole matched part of line, no matter where you've put the parentesizes; in case of lookaheads present, anything that did not belongs to them is copied.
So, the matches will return:
- PCRE:
1.2345678
(at 2300, this looks like original string and I start yelling here at SO)
- lookahead:
1