I'm trying to match the local part of an email address before the @ character with:
LOCAL_RE_NOTQUOTED = """
((
\w # alphanumeric and _
| [!#$%&'*+-/=?^_`{|}~] # special chars, but no dot at beginning
)
(
\w # alphanumeric and _
| [!#$%&'*+-/=?^_`{|}~] # special characters
| ([.](?![.])) # negative lookahead to avoid pairs of dots.
)*)
(?<!\.)(?:@) # no end with dot before @
"""
Testing with:
re.match(LOCAL_RE_NOTQUOTED, "a.a..a@", re.VERBOSE).group()
gives:
'a.a..a@'
Why is the @
printed in the output, even though I'm using a non-capturing group (?:@)
?
Testing with:
re.match(LOCAL_RE_NOTQUOTED, "a.a..a@", re.VERBOSE).groups()
gives:
('a.a..a', 'a', 'a', None)
Why does the regex not reject the string with a pair of dots '..'
?
You're confusing non-capturing groups (?:...)
and lookahead assertions (?=...)
.
The former do participate in the match (and are thus part of match.group()
which contains the overall match), they just don't generate a backreference ($1
etc. for later use).
The second problem (Why is the double dot matched?) is a bit trickier. This is because of an error in your regex. You see, when you wrote (shortened to make the point)
[+-/]
you wrote "Match a character between +
and /
, and in ASCII, the dot is right between them (ASCII 43-47: +,-./
). Therefore, the first character class matches the dot, and the lookahead assertion is never reached. You need to place the dash at the end of the character class to treat it as a literal dash:
((
\w # alphanumeric and _
| [!#$%&'*+/=?^_`{|}~-] # special chars, but no dot at beginning
)
(
\w # alphanumeric and _
| [!#$%&'*+/=?^_`{|}~-] # special characters
| ([.](?![.])) # negative lookahead to avoid pairs of dots.
)*)
(?<!\.)(?=@) # no end with dot before @
And of course, if you want to use this logic, you can streamline it a bit:
^(?!\.) # no dot at the beginning
(?:
[\w!#$%&'*+/=?^_`{|}~-] # alnums or special characters except dot
| (\.(?![.@])) # or dot unless it's before a dot or @
)*
(?=@) # end before @