RegEx to exclude match if a certain word is presen

2019-07-14 05:08发布

I have the keyword "cum" which our firewall uses to block adult sites, problem is this works a little too well because this also blocks any URL with the word "document"

The firewall will take regex strings, and I tried this:

^.*(?!document)cum.*$

Vut it still matches "document". I have a feeling I should be using a pipe | but I don't get it.

What I want is to match anywhere

*cum*

is found in the URL (or domain-name), but NOT if the word is document or documents.

Possible? As I understand it, a word boundary doesn't work here because the word cum won't necessarily be separated by white-space when it's in a URL, and definitely not if it's in a domain-name.

Here's another way to put it:

Allow "examplesearchdocuments.com"
Allow "examplemydocuments.com"
Allow "documentexample.com"
Allow "example.com/somedocuments"
Don't allow "funnycumsiteexample.com"
Don't allow "cumallovereverythingexample.com"
Don't allow "exampleseemycum.com"

where cum being the bad word match. Sorry if any of these examples are real sites, I don't know how else to convey this.

2条回答
姐就是有狂的资本
2楼-- · 2019-07-14 05:55

My first suggestion would also be to use \bcum\b as the others, but that doesn't match e.g. cumming.

You're almost right with the negative lookaround (?!) syntax:

^.*(?<!do)cum(?!ent).*$

or

^.*(?<!do)cum(?!ents?).*$

to support plural. You can check it at: http://fiddle.re/3pyj by clicking Java for a the examples you provided.

My suggestion would be ^.*\bcum.*$ to match a word boundary, i.e. word start and the 'cum' and anything after.

查看更多
We Are One
3楼-- · 2019-07-14 06:03

Per the comments, I was wrong.

If you use a lookbehind inside your lookahead, you can match "cum" only if it is not within the word "document".

cum(?!(?<=docum)ent)

Here is some reading on lookaround http://www.regular-expressions.info/lookaround.html

Here it is against a large number of tests.

http://www.rubular.com/r/b5iZrn6Cjz

查看更多
登录 后发表回答