I've recently started using ElasticSearch. I try to complete some use cases. I have a problem for one of them.
I have indexed some users with their full name (e.g. "Jean-Paul Gautier", "Jean De La Fontaine").
I try to get all the full names responding to some query.
For example, I want the 100 most frequent full names beggining by "J"
{
"query": {
"query_string" : { "query": "full_name:J*" } }
},
"facets":{
"name":{
"terms":{
"field": "full_name",
"size":100
}
}
}
}
The result I get is all the words of the full names : "Jean", "Paul", "Gautier", "De", "La", "Fontaine".
How to get "Jean-Paul Gautier" and "Jean De La Fontaine" (all the full_name values begging by 'J') ? The "post_filter" option is not doing this, it only restrict this above subset.
- I have to configure "how works" this full_name facet
- I have to add some options to this current query
- I have to do some "mapping" (very obscure for the moment)
Thanks
Try altering the mapping for "full_name":
not_analyzed
means that it will be kept as is, capitals, spaces, dashes etc, so that "Jean De La Fontaine" will stay findable and not be tokenized into "Jean" "De" "La" "Fontaine"You can experiment with different analyzers using the api
Notice what the standard one does to a mulit part name:
You just need to set
"index": "not_analyzed"
on the field, and you will be able to get back the full, unmodified field values in your facet.Typically, it's nice to have one version of the field that isn't analyzed (for faceting) and another that is (for searching). The
"multi_field"
field type is useful for this.So in this case, I can define a mapping as follows:
Here we have two sub-fields. The one with the same name as the parent will be the default, so if you search against the
"full_name"
field, Elasticsearch will actually use"full_name.full_name"
."full_name.untouched"
will give you the facet results you want.So next I add two documents:
And then I can facet on each field to see what is returned:
and I get back the following:
As you can see, the analyzed field returns single-word, lower-cased tokens (when you don't specify an analyzer, the standard analyzer is used), and the un-analyzed sub-field returns the unmodified original text.
Here is a runnable example you can play with: http://sense.qbox.io/gist/7abc063e2611846011dd874648fd1b77450b19a5