I have the following regex in a C# program, and have difficulties understanding it:
(?<=#)[^#]+(?=#)
I'll break it down to what I think I understood:
(?<=#) a group, matching a hash. what's `?<=`?
[^#]+ one or more non-hashes (used to achieve non-greediness)
(?=#) another group, matching a hash. what's the `?=`?
So the problem I have is the ?<=
and ?<
part. From reading MSDN, ?<name>
is used for naming groups, but in this case the angle bracket is never closed.
I couldn't find ?=
in the docs, and searching for it is really difficult, because search engines will mostly ignore those special chars.
They're called look-arounds: http://www.regular-expressions.info/lookaround.html
As another poster mentioned, these are lookarounds, special constructs for changing what gets matched and when. This says:
So this will match all the characters in between two
#
s.Lookaheads and lookbehinds are very useful in many cases. Consider, for example, the rule "match all
b
s not followed by ana
." Your first attempt might be something likeb[^a]
, but that's not right: this will also match thebu
inbus
or thebo
inboy
, but you only wanted theb
. And it won't match theb
incab
, even though that's not followed by ana
, because there are no more characters to match.To do that correctly, you need a lookahead:
b(?!a)
. This says "match ab
but don't match ana
afterwards, and don't make that part of the match". Thus it'll match just theb
inbolo
, which is what you want; likewise it'll match theb
incab
.They are called lookarounds; they allow you to assert if a pattern matches or not, without actually making the match. There are 4 basic lookarounds:
pattern
...(?=pattern)
- ... to the right of current position (look ahead)(?<=pattern)
- ... to the left of current position (look behind)pattern
(?!pattern)
- ... to the right(?<!pattern)
- ... to the leftAs an easy reminder, for a lookaround:
=
is positive,!
is negative<
is look behind, otherwise it's look aheadReferences
But why use lookarounds?
One might argue that lookarounds in the pattern above aren't necessary, and
#([^#]+)#
will do the job just fine (extracting the string captured by\1
to get the non-#
).Not quite. The difference is that since a lookaround doesn't match the
#
, it can be "used" again by the next attempt to find a match. Simplistically speaking, lookarounds allow "matches" to overlap.Consider the following input string:
Now,
#([a-z]+)#
will give the following matches (as seen on rubular.com):Compare this with
(?<=#)[a-z]+(?=#)
, which matches:Unfortunately this can't be demonstrated on rubular.com, since it doesn't support lookbehind. However, it does support lookahead, so we can do something similar with
#([a-z]+)(?=#)
, which matches (as seen on rubular.com):References