I'm trying to do a negative lookahead on an elasticsearch query,
the regex is:
(?!.*charge)(?!.*encode)(?!.*relate).*night.*
the text that I'm matching against is:
credited back on night stay, still having issues with construction.
causing health issues due to a chemical being sprayed and causes eyes
to irritated.
I didn't get any lucky. Can someone give a hand?
ES query:
"query": {
"filtered": {
"query": {
"bool": {
"must_not": [
{
"regexp": {
"message": {
"value": "(?!.*charge)(?!.*encode)(?!.*relate).*night.*",
"flags_value": 65535
}
}
}
]
}
},
"filter": {
"match": {
"resNb": {
"query": "462031152161",
"type": "boolean"
}
}
}
}
}
Solution
You can solve the issue with either of the two:
"value": "~(charge|encode|relate)night~(charge|encode|relate)",
or
.*night.*&~(.*(charge|encode|relate).*)
With an optional (since it is ON by default)
"flags" : "ALL"
How does it work?
In common NFA regular expressions, you usually have negative lookarounds that help restrict a more generic pattern (those that look like (?!...)
or (?<!...)
). However, in ElasticSearch, you need to use specific optional operators.
The ~
(tilde) is the complement that is *used to negate an atom right after it. An atom is either a single symbol or a group of subpatterns/alternatives inside a group.
NOTE that all ES patterns are anchored at the start and end of string by default, you never need to use ^
and $
common in Perl-like and .NET, and other NFAs.
Thus,
~(charge|encode|relate)
- matches any text from the start of the string other than charge
, encode
and relate
night
- matches the word night
~(charge|encode|relate)
- matches any text other than either of the 3 substrings up to the end of string.
In an NFA regex like Perl, you could write that pattern using a tempered greedy token:
/^(?:(?!charge|encode|relate).)*night(?:(?!charge|encode|relate).)*$/
The second pattern is trickier: common NFA regexes usually do not jump from location to location when matching, thus, lookaheads anchored at the start of text are commonly used. Here, using an INTERSECTION we can just use 2 patterns, where one will be matching the string and the second one should also match the string.
.*night.*
- match the whole line (as .
matches any symbol but a newline, else, use (.|\n)*
) with night
in it
&
- and
~(.*(charge|encode|relate).*)
- the line that does not have charge
, encode
and relate
substrings in it.
An NFA Perl-like regex would look like
/^(?!.*(charge|encode|relate)).*night.*$/
You didn't use an anchor for your lookaheads. Try using "^" at the beginning of the pattern and it should work.