negative lookahead regex on elasticsearch

2019-02-18 06:12发布

问题:

I'm trying to do a negative lookahead on an elasticsearch query, the regex is:

(?!.*charge)(?!.*encode)(?!.*relate).*night.*

the text that I'm matching against is:

credited back on night stay, still having issues with construction. causing health issues due to a chemical being sprayed and causes eyes to irritated.

I didn't get any lucky. Can someone give a hand?

ES query:

  "query": {
    "filtered": {
      "query": {
        "bool": {
          "must_not": [
            {
              "regexp": {
                "message": {
                  "value": "(?!.*charge)(?!.*encode)(?!.*relate).*night.*",
                  "flags_value": 65535
                }
              }
            }
          ]
        }
      },
      "filter": {
        "match": {
          "resNb": {
            "query": "462031152161",
            "type": "boolean"
          }
        }
      }
    }
  }

回答1:

Solution

You can solve the issue with either of the two:

"value": "~(charge|encode|relate)night~(charge|encode|relate)",

or

.*night.*&~(.*(charge|encode|relate).*)

With an optional (since it is ON by default)

"flags" : "ALL"

How does it work?

In common NFA regular expressions, you usually have negative lookarounds that help restrict a more generic pattern (those that look like (?!...) or (?<!...)). However, in ElasticSearch, you need to use specific optional operators.

The ~ (tilde) is the complement that is *used to negate an atom right after it. An atom is either a single symbol or a group of subpatterns/alternatives inside a group.

NOTE that all ES patterns are anchored at the start and end of string by default, you never need to use ^ and $ common in Perl-like and .NET, and other NFAs.

Thus,

  • ~(charge|encode|relate) - matches any text from the start of the string other than charge, encode and relate
  • night - matches the word night
  • ~(charge|encode|relate) - matches any text other than either of the 3 substrings up to the end of string.

In an NFA regex like Perl, you could write that pattern using a tempered greedy token:

/^(?:(?!charge|encode|relate).)*night(?:(?!charge|encode|relate).)*$/

The second pattern is trickier: common NFA regexes usually do not jump from location to location when matching, thus, lookaheads anchored at the start of text are commonly used. Here, using an INTERSECTION we can just use 2 patterns, where one will be matching the string and the second one should also match the string.

  • .*night.* - match the whole line (as . matches any symbol but a newline, else, use (.|\n)*) with night in it
  • & - and
  • ~(.*(charge|encode|relate).*) - the line that does not have charge, encode and relate substrings in it.

An NFA Perl-like regex would look like

/^(?!.*(charge|encode|relate)).*night.*$/


回答2:

You didn't use an anchor for your lookaheads. Try using "^" at the beginning of the pattern and it should work.