Is there a simple way to match all characters in a class except a certain set of them? For example if in a lanaguage where I can use \w to match the set of all unicode word characters, is there a way to just exclude a character like an underscore "_" from that match?
Only idea that came to mind was to use negative lookahead/behind around each character but that seems more complex than necessary when I effectively just want to match a character against a positive match AND negative match. For example if & was an AND operator I could do this...
^(\w&[^_])+$
Try using subtraction:
Note: This will work in Java, but might not in some other Regex engine.
You can use a negation of the
\w
class (-->\W
) and exclude it:It really depends on your regex flavor.
.NET
... provides only one simple character class set operation: subtraction. This is enough for your example, so you can simply use
If a
-
is followed by a nested character class, it's subtracted. Simple as that...Java
... provides a much richer set of character class set operations. In particular you can get the intersection of two sets like
[[abc]&&[cde]]
(which would givec
in this case). Intersection and negation together give you subtraction:Perl
... supports set operations on extended character classes as an experimental feature (available since Perl 5.18). In particular, you can directly subtract arbitrary character classes:
All other flavors
... (that support lookaheads) allow you to mimic the subtraction by using a negative lookahead:
This first checks that the next character is not a
_
and then matches any\w
(which can't be_
due to the negative lookahead).Note that each of these approaches is completely general in that you can subtract two arbitrarily complex character classes.
This can be done in python with the regex module. Something like:
You'd typically install the regex module with pip:
EDIT:
The regex module has two behaviours, version 0 and version 1. Set substraction (as above) is a version 1 behaviour. The pypi docs claim version 1 is the default behaviour, but you may find this is not the case. You can check with
To set it to version 1:
or to use version one in a single expression:
A negative lookahead is the correct way to go insofar as I understand your question: