Is there a simple way to match all characters in a class except a certain set of them? For example if in a lanaguage where I can use \w to match the set of all unicode word characters, is there a way to just exclude a character like an underscore "_" from that match?
Only idea that came to mind was to use negative lookahead/behind around each character but that seems more complex than necessary when I effectively just want to match a character against a positive match AND negative match. For example if & was an AND operator I could do this...
^(\w&[^_])+$
It really depends on your regex flavor.
.NET
... provides only one simple character class set operation: subtraction. This is enough for your example, so you can simply use
[\w-[_]]
If a -
is followed by a nested character class, it's subtracted. Simple as that...
Java
... provides a much richer set of character class set operations. In particular you can get the intersection of two sets like [[abc]&&[cde]]
(which would give c
in this case). Intersection and negation together give you subtraction:
[\w&&[^_]]
Perl
... supports set operations on extended character classes as an experimental feature (available since Perl 5.18). In particular, you can directly subtract arbitrary character classes:
(?[ \w - [_] ])
All other flavors
... (that support lookaheads) allow you to mimic the subtraction by using a negative lookahead:
(?!_)\w
This first checks that the next character is not a _
and then matches any \w
(which can't be _
due to the negative lookahead).
Note that each of these approaches is completely general in that you can subtract two arbitrarily complex character classes.
You can use a negation of the \w
class (--> \W
) and exclude it:
^([^\W_]+)$
A negative lookahead is the correct way to go insofar as I understand your question:
^((?!_)\w)+$
Try using subtraction:
[\w&&[^_]]+
Note: This will work in Java, but might not in some other Regex engine.
This can be done in python with the regex module. Something like:
import regex as re
pattern = re.compile(r'[\W_--[ ]]+')
cleanString = pattern.sub('', rawString)
You'd typically install the regex module with pip:
pip install regex
EDIT:
The regex module has two behaviours, version 0 and version 1. Set substraction (as above) is a version 1 behaviour. The pypi docs claim version 1 is the default behaviour, but you may find this is not the case. You can check with
import regex
if regex.DEFAULT_VERSION == regex.VERSION1:
print("version 1")
To set it to version 1:
regex.DEFAULT_VERSION = regex.VERSION1
or to use version one in a single expression:
pattern = re.compile(r'(?V1)[\W_--[ ]]+')