Word-oriented completion suggester (ElasticSearch

2019-01-17 03:42发布

ElasticSearch 5.x introduced some (breaking) changes to the Suggester API (Documentation). Most notable change is the following:

Completion suggester is document-oriented

Suggestions are aware of the document they belong to. Now, associated documents (_source) are returned as part of completion suggestions.

In short, all completion queries return all matching documents instead of just matched words. And herein lies the problem - duplication of autocompleted words if they occur in more than one document.

Let's say we have this simple mapping:

{
   "my-index": {
      "mappings": {
         "users": {
            "properties": {
               "firstName": {
                  "type": "text"
               },
               "lastName": {
                  "type": "text"
               },
               "suggest": {
                  "type": "completion",
                  "analyzer": "simple"
               }
            }
         }
      }
   }
}

With a few test documents:

{
   "_index": "my-index",
   "_type": "users",
   "_id": "1",
   "_source": {
      "firstName": "John",
      "lastName": "Doe",
      "suggest": [
         {
            "input": [
               "John",
               "Doe"
            ]
         }
      ]
   }
},
{
   "_index": "my-index",
   "_type": "users",
   "_id": "2",
   "_source": {
      "firstName": "John",
      "lastName": "Smith",
      "suggest": [
         {
            "input": [
               "John",
               "Smith"
            ]
         }
      ]
   }
}

And a by-the-book query:

POST /my-index/_suggest?pretty
{
    "my-suggest" : {
        "text" : "joh",
        "completion" : {
            "field" : "suggest"
        }
    }
}

Which yields the following results:

{
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "my-suggest": [
      {
         "text": "joh",
         "offset": 0,
         "length": 3,
         "options": [
            {
               "text": "John",
               "_index": "my-index",
               "_type": "users",
               "_id": "1",
               "_score": 1,
               "_source": {
                 "firstName": "John",
                 "lastName": "Doe",
                 "suggest": [
                    {
                       "input": [
                          "John",
                          "Doe"
                       ]
                    }
                 ]
               }
            },
            {
               "text": "John",
               "_index": "my-index",
               "_type": "users",
               "_id": "2",
               "_score": 1,
               "_source": {
                 "firstName": "John",
                 "lastName": "Smith",
                 "suggest": [
                    {
                       "input": [
                          "John",
                          "Smith"
                       ]
                    }
                 ]
               }
            }
         ]
      }
   ]
}

In short, for a completion suggest for text "joh", two (2) documents were returned - both John's and both had the same value of the text property.

However, I would like to receive one (1) word. Something simple like this:

{
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "my-suggest": [
      {
         "text": "joh",
         "offset": 0,
         "length": 3,
         "options": [
          "John"
         ]
      }
   ]
}

Question: how to implement a word-based completion suggester. There is no need to return any document related data, since I don't need it at this point.

Is the "Completion Suggester" even appropriate for my scenario? Or should I use a completely different approach?


EDIT: As many of you pointed out, an additional completion-only index would be a viable solution. However, I can see multiple issues with this approach:

  1. Keeping the new index in sync.
  2. Auto-completing subsequent words would probably be global, instead of narrowed down. For example, say you have the following words in the additional index: "John", "Doe", "David", "Smith". When querying for "John D", the result for the incomplete word should be "Doe" and not "Doe", "David".

To overcome the second point, only indexing single words wouldn't be enough, since you would also need to map all words to documents in order to properly narrow down auto-completing subsequent words. And with this, you actually have the same problem as querying the original index. Therefore, the additional index doesn't make sense anymore.

3条回答
祖国的老花朵
2楼-- · 2019-01-17 04:35

We face exactly the same problem. In Elasticsearch 2.4 the approach like you describe used to work fine for us but now as you say the suggester has become document-based while like you we are only interested in unique words, not in the documents.

The only 'solution' we could think of so far is to create a separate index just for the words on which we want to perform the suggestion queries and in this separate index make sure somehow that identical words are only indexed once. Then you could perform the suggestion queries on this separate index. This is far from ideal, if only because we will then need to make sure that this index remains in sync with the other index that we need for our other queries.

查看更多
Emotional °昔
3楼-- · 2019-01-17 04:36

As hinted at in the comment, another way of achieving this without getting the duplicate documents is to create a sub-field for the firstname field containing ngrams of the field. First you define your mapping like this:

PUT my-index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "completion_analyzer": {
          "type": "custom",
          "filter": [
            "lowercase",
            "completion_filter"
          ],
          "tokenizer": "keyword"
        }
      },
      "filter": {
        "completion_filter": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 24
        }
      }
    }
  },
  "mappings": {
    "users": {
      "properties": {
        "autocomplete": {
          "type": "text",
          "fields": {
            "raw": {
              "type": "keyword"
            },
            "completion": {
              "type": "text",
              "analyzer": "completion_analyzer",
              "search_analyzer": "standard"
            }
          }
        },
        "firstName": {
          "type": "text"
        },
        "lastName": {
          "type": "text"
        }
      }
    }
  }
}

Then you index a few documents:

POST my-index/users/_bulk
{"index":{}}
{ "firstName": "John", "lastName": "Doe", "autocomplete": "John Doe"}
{"index":{}}
{ "firstName": "John", "lastName": "Deere", "autocomplete": "John Deere" }
{"index":{}}
{ "firstName": "Johnny", "lastName": "Cash", "autocomplete": "Johnny Cash" }

Then you can query for joh and get one result for John and another one for Johnny

{
  "size": 0,
  "query": {
    "term": {
      "autocomplete.completion": "john d"
    }
  },
  "aggs": {
    "suggestions": {
      "terms": {
        "field": "autocomplete.raw"
      }
    }
  }
}

Results:

{
  "aggregations": {
    "suggestions": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "John Doe",
          "doc_count": 1
        },
        {
          "key": "John Deere",
          "doc_count": 1
        }
      ]
    }
  }
}
查看更多
爷、活的狠高调
4楼-- · 2019-01-17 04:48

An additional field skip_duplicates will be added in the next release 6.x.

From the docs at https://www.elastic.co/guide/en/elasticsearch/reference/master/search-suggesters-completion.html#skip_duplicates:

POST music/_search?pretty
{
    "suggest": {
        "song-suggest" : {
            "prefix" : "nor",
            "completion" : {
                "field" : "suggest",
                "skip_duplicates": true
            }
        }
    }
}
查看更多
登录 后发表回答