I have the keyword "cum" which our firewall uses to block adult sites, problem is this works a little too well because this also blocks any URL with the word "document"
The firewall will take regex strings, and I tried this:
^.*(?!document)cum.*$
Vut it still matches "document". I have a feeling I should be using a pipe |
but I don't get it.
What I want is to match anywhere
*cum*
is found in the URL (or domain-name), but NOT if the word is document
or documents
.
Possible? As I understand it, a word boundary doesn't work here because the word cum
won't necessarily be separated by white-space when it's in a URL, and definitely not if it's in a domain-name.
Here's another way to put it:
Allow "examplesearchdocuments.com"
Allow "examplemydocuments.com"
Allow "documentexample.com"
Allow "example.com/somedocuments"
Don't allow "funnycumsiteexample.com"
Don't allow "cumallovereverythingexample.com"
Don't allow "exampleseemycum.com"
where cum
being the bad word match. Sorry if any of these examples are real sites, I don't know how else to convey this.
My first suggestion would also be to use
\bcum\b
as the others, but that doesn't match e.g. cumming.You're almost right with the negative lookaround
(?!)
syntax:<
>
or
to support plural. You can check it at: http://fiddle.re/3pyj by clicking Java for a the examples you provided.
My suggestion would be
^.*\bcum.*$
to match a word boundary, i.e. word start and the 'cum' and anything after.Per the comments, I was wrong.
If you use a lookbehind inside your lookahead, you can match "cum" only if it is not within the word "document".
Here is some reading on lookaround http://www.regular-expressions.info/lookaround.html
Here it is against a large number of tests.
http://www.rubular.com/r/b5iZrn6Cjz