可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
Let's say I have this given data
{
"name" : "ABC",
"favorite_cars" : [ "ferrari","toyota" ]
}, {
"name" : "ABC",
"favorite_cars" : [ "ferrari","toyota" ]
}, {
"name" : "GEORGE",
"favorite_cars" : [ "honda","Hyundae" ]
}
Whenever I query this data when searching for people who's favorite car is toyota, it returns this data
{
"name" : "ABC",
"favorite_cars" : [ "ferrari","toyota" ]
}, {
"name" : "ABC",
"favorite_cars" : [ "ferrari","toyota" ]
}
the result is Two records of with a name of ABC. How do I select distinct documents only? The result I want to get is only this
{
"name" : "ABC",
"favorite_cars" : [ "ferrari","toyota" ]
}
Here's my Query
{
"fuzzy_like_this_field" : {
"favorite_cars" : {
"like_text" : "toyota",
"max_query_terms" : 12
}
}
}
I am using ElasticSearch 1.0.0. with the java api client
回答1:
You can eliminate duplicates using aggregations. With term aggregation the results will be grouped by one field, e.g. name
, also providing a count of the ocurrences of each value of the field, and will sort the results by this count (descending).
{
"query": {
"fuzzy_like_this_field": {
"favorite_cars": {
"like_text": "toyota",
"max_query_terms": 12
}
}
},
"aggs": {
"grouped_by_name": {
"terms": {
"field": "name",
"size": 0
}
}
}
}
In addition to the hits
, the result will also contain the buckets
with the unique values in key
and with the count in doc_count
:
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 0.19178301,
"hits" : [ {
"_index" : "pru",
"_type" : "pru",
"_id" : "vGkoVV5cR8SN3lvbWzLaFQ",
"_score" : 0.19178301,
"_source":{"name":"ABC","favorite_cars":["ferrari","toyota"]}
}, {
"_index" : "pru",
"_type" : "pru",
"_id" : "IdEbAcI6TM6oCVxCI_3fug",
"_score" : 0.19178301,
"_source":{"name":"ABC","favorite_cars":["ferrari","toyota"]}
} ]
},
"aggregations" : {
"grouped_by_name" : {
"buckets" : [ {
"key" : "abc",
"doc_count" : 2
} ]
}
}
}
Note that using aggregations will be costly because of duplicate elimination and result sorting.
回答2:
ElasticSearch doesn't provide any query by which you can get distinct documents based a field value.
Ideally you should have indexed the same document with same type and id since these two things are used by ElasticSearch to give a _uid unique id to a document. Unique id is important not only because of its way of detecting duplicate documents but also updating the same document in case of any modification instead of inserting a new one. For more information about indexing documents you can read this.
But there is definitely a work around for your problem. Since you are using java api client, you can remove duplicate documents based on a field value on your own. Infact, it gives you more flexibility to perform custom operations on the responses that you get from ES.
SearchResponse response = client.prepareSearch().execute().actionGet();
SearchHits hits = response.getHits();
Iterator<SearchHit> iterator = hits.iterator();
Map<String, SearchHit> distinctObjects = new HashMap<String,SearchHit>();
while (iterator.hasNext()) {
SearchHit searchHit = (SearchHit) iterator.next();
Map<String, Object> source = searchHit.getSource();
if(source.get("name") != null){
distinctObjects.put(source.get("name").toString(),source);
}
}
So, you will have a map of unique searchHit objects in your map.
You can also create an object mapping and use that in place of SearchHit.
I hope this solves your problem. Please forgive me if there are any errors in the code. This is just a pseudo-ish code to make you understand how you can solve your problem.
Thanks
回答3:
@JRL is almost corrrect. You will need an aggregation in your query. This will get you a list of the top 10000 "favorite_cars" in your object ordered by occurance.
{
"query":{ "match_all":{ } },
"size":0,
"Distinct" : {
"Cars" : {
"terms" : { "field" : "favorite_cars", "order": { "_count": "desc"}, "size":10000 }
}
}
}
It is also worth noting that you are going to want your "favorite_car" field to not be analyzed in order to get "McLaren F1" instead of "McLaren ", "F1".
"favorite_car": {
"type": "string",
"index": "not_analyzed"
}
回答4:
For a single shard this can be handled using custom filter which also takes care of pagination. To handle the above use case we can use the script support as follows:
- Define a custom script filter. For this discussion assume it is called AcceptDistinctDocumentScriptFilter
- This custom filter takes in a list of primary keys as input.
- These primary keys are the fields whose values will be used to determine uniqueness of records.
- Now, instead of using aggregation we use normal search request and pass the custom script filter to the request.
- If the search already has a filter\query criteria defined then append the custom filter using logical AND operator.
- Following is example using pseudo syntax
if the request is:
select * from myindex where file_hash = 'hash_value'
then append the custom filter as:
select * from myindex where file_hash = 'hash_value' AND AcceptDistinctDocumentScriptFilter(params= ['file_name', 'file_folder'])
For distributed search this is tricky and needs plugin to hook into QUERY phase. More details here.