ElasticSearch - return the complete value of a fac

I've recently started using ElasticSearch. I try to complete some use cases. I have a problem for one of them.

I have indexed some users with their full name (e.g. "Jean-Paul Gautier", "Jean De La Fontaine").

I try to get all the full names responding to some query.

For example, I want the 100 most frequent full names beggining by "J"

{
  "query": {
    "query_string" : { "query": "full_name:J*" } }
  },
  "facets":{
    "name":{
      "terms":{
        "field": "full_name",
        "size":100
      }
    }
  }
}

The result I get is all the words of the full names : "Jean", "Paul", "Gautier", "De", "La", "Fontaine".

How to get "Jean-Paul Gautier" and "Jean De La Fontaine" (all the full_name values begging by 'J') ? The "post_filter" option is not doing this, it only restrict this above subset.

I have to configure "how works" this full_name facet
I have to add some options to this current query
I have to do some "mapping" (very obscure for the moment)

Thanks

标签： lucene elasticsearch

2条回答

混吃等死

2楼-- · 2019-04-06 20:50

Try altering the mapping for "full_name":

"properties": {
  "full_name": {
     "type": "string",
     "index": "not_analyzed"
  }
  ...
}

not_analyzed means that it will be kept as is, capitals, spaces, dashes etc, so that "Jean De La Fontaine" will stay findable and not be tokenized into "Jean" "De" "La" "Fontaine"

You can experiment with different analyzers using the api

Notice what the standard one does to a mulit part name:

GET /_analyze?analyzer=standard
{'Jean Claude Van Dame'}


{
   "tokens": [
      {
         "token": "jean",
         "start_offset": 2,
         "end_offset": 6,
         "type": "<ALPHANUM>",
         "position": 1
      },
      {
         "token": "claude",
         "start_offset": 7,
         "end_offset": 13,
         "type": "<ALPHANUM>",
         "position": 2
      },
      {
         "token": "van",
         "start_offset": 14,
         "end_offset": 17,
         "type": "<ALPHANUM>",
         "position": 3
      },
      {
         "token": "dame",
         "start_offset": 18,
         "end_offset": 22,
         "type": "<ALPHANUM>",
         "position": 4
      }
   ]
}

0人赞添加讨论(0) 举报

地球回转人心会变

3楼-- · 2019-04-06 21:08

You just need to set "index": "not_analyzed" on the field, and you will be able to get back the full, unmodified field values in your facet.

Typically, it's nice to have one version of the field that isn't analyzed (for faceting) and another that is (for searching). The "multi_field" field type is useful for this.

So in this case, I can define a mapping as follows:

curl -XPUT "http://localhost:9200/test_index/" -d'
{
   "mappings": {
      "people": {
         "properties": {
            "full_name": {
               "type": "multi_field",
               "fields": {
                  "untouched": {
                     "type": "string",
                     "index": "not_analyzed"
                  },
                  "full_name": {
                     "type": "string"
                  }
               }
            }
         }
      }
   }
}'

Here we have two sub-fields. The one with the same name as the parent will be the default, so if you search against the "full_name" field, Elasticsearch will actually use "full_name.full_name". "full_name.untouched" will give you the facet results you want.

So next I add two documents:

curl -XPUT "http://localhost:9200/test_index/people/1" -d'
{
   "full_name": "Jean-Paul Gautier"
}'

curl -XPUT "http://localhost:9200/test_index/people/2" -d'
{
   "full_name": "Jean De La Fontaine"
}'

And then I can facet on each field to see what is returned:

curl -XPOST "http://localhost:9200/test_index/_search" -d'
{
   "size": 0,
   "facets": {
      "name_terms": {
         "terms": {
            "field": "full_name"
         }
      },
      "name_untouched": {
         "terms": {
            "field": "full_name.untouched",
            "size": 100
         }
      }
   }
}'

and I get back the following:

{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 2,
      "max_score": 0,
      "hits": []
   },
   "facets": {
      "name_terms": {
         "_type": "terms",
         "missing": 0,
         "total": 7,
         "other": 0,
         "terms": [
            {
               "term": "jean",
               "count": 2
            },
            {
               "term": "paul",
               "count": 1
            },
            {
               "term": "la",
               "count": 1
            },
            {
               "term": "gautier",
               "count": 1
            },
            {
               "term": "fontaine",
               "count": 1
            },
            {
               "term": "de",
               "count": 1
            }
         ]
      },
      "name_untouched": {
         "_type": "terms",
         "missing": 0,
         "total": 2,
         "other": 0,
         "terms": [
            {
               "term": "Jean-Paul Gautier",
               "count": 1
            },
            {
               "term": "Jean De La Fontaine",
               "count": 1
            }
         ]
      }
   }
}

As you can see, the analyzed field returns single-word, lower-cased tokens (when you don't specify an analyzer, the standard analyzer is used), and the un-analyzed sub-field returns the unmodified original text.

Here is a runnable example you can play with: http://sense.qbox.io/gist/7abc063e2611846011dd874648fd1b77450b19a5

0人赞添加讨论(0) 举报

ElasticSearch - return the complete value of a fac

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间