可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

Let's say I have this given data

{
            "name" : "ABC",
            "favorite_cars" : [ "ferrari","toyota" ]
          }, {
            "name" : "ABC",
            "favorite_cars" : [ "ferrari","toyota" ]
          }, {
            "name" : "GEORGE",
            "favorite_cars" : [ "honda","Hyundae" ]
          }

Whenever I query this data when searching for people who's favorite car is toyota, it returns this data

{

            "name" : "ABC",
            "favorite_cars" : [ "ferrari","toyota" ]
          }, {
            "name" : "ABC",
            "favorite_cars" : [ "ferrari","toyota" ]
          }

the result is Two records of with a name of ABC. How do I select distinct documents only? The result I want to get is only this

{
                "name" : "ABC",
                "favorite_cars" : [ "ferrari","toyota" ]
              }

Here's my Query

{
    "fuzzy_like_this_field" : {
        "favorite_cars" : {
            "like_text" : "toyota",
            "max_query_terms" : 12
        }
    }
}

I am using ElasticSearch 1.0.0. with the java api client

回答1:

You can eliminate duplicates using aggregations. With term aggregation the results will be grouped by one field, e.g. name, also providing a count of the ocurrences of each value of the field, and will sort the results by this count (descending).

{
  "query": {
    "fuzzy_like_this_field": {
      "favorite_cars": {
        "like_text": "toyota",
        "max_query_terms": 12
      }
    }
  },
  "aggs": {
    "grouped_by_name": {
      "terms": {
        "field": "name",
        "size": 0
      }
    }
  }
}

In addition to the hits, the result will also contain the buckets with the unique values in key and with the count in doc_count:

{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.19178301,
    "hits" : [ {
      "_index" : "pru",
      "_type" : "pru",
      "_id" : "vGkoVV5cR8SN3lvbWzLaFQ",
      "_score" : 0.19178301,
      "_source":{"name":"ABC","favorite_cars":["ferrari","toyota"]}
    }, {
      "_index" : "pru",
      "_type" : "pru",
      "_id" : "IdEbAcI6TM6oCVxCI_3fug",
      "_score" : 0.19178301,
      "_source":{"name":"ABC","favorite_cars":["ferrari","toyota"]}
    } ]
  },
  "aggregations" : {
    "grouped_by_name" : {
      "buckets" : [ {
        "key" : "abc",
        "doc_count" : 2
      } ]
    }
  }
}

Note that using aggregations will be costly because of duplicate elimination and result sorting.

回答2:

ElasticSearch doesn't provide any query by which you can get distinct documents based a field value.

Ideally you should have indexed the same document with same type and id since these two things are used by ElasticSearch to give a _uid unique id to a document. Unique id is important not only because of its way of detecting duplicate documents but also updating the same document in case of any modification instead of inserting a new one. For more information about indexing documents you can read this.

But there is definitely a work around for your problem. Since you are using java api client, you can remove duplicate documents based on a field value on your own. Infact, it gives you more flexibility to perform custom operations on the responses that you get from ES.

SearchResponse response = client.prepareSearch().execute().actionGet();
SearchHits hits = response.getHits();

Iterator<SearchHit> iterator = hits.iterator();
Map<String, SearchHit> distinctObjects = new HashMap<String,SearchHit>();
while (iterator.hasNext()) {
    SearchHit searchHit = (SearchHit) iterator.next();
    Map<String, Object> source = searchHit.getSource();
    if(source.get("name") != null){
        distinctObjects.put(source.get("name").toString(),source);
    }

}

So, you will have a map of unique searchHit objects in your map.

You can also create an object mapping and use that in place of SearchHit.

I hope this solves your problem. Please forgive me if there are any errors in the code. This is just a pseudo-ish code to make you understand how you can solve your problem.

Thanks

回答3:

@JRL is almost corrrect. You will need an aggregation in your query. This will get you a list of the top 10000 "favorite_cars" in your object ordered by occurance.

{
    "query":{ "match_all":{ } },
    "size":0,
    "Distinct" : {
        "Cars" : {
            "terms" : { "field" : "favorite_cars", "order": { "_count": "desc"}, "size":10000 }
         }
    }
}

It is also worth noting that you are going to want your "favorite_car" field to not be analyzed in order to get "McLaren F1" instead of "McLaren ", "F1".

"favorite_car": {
    "type": "string",
    "index": "not_analyzed"
}

回答4:

For a single shard this can be handled using custom filter which also takes care of pagination. To handle the above use case we can use the script support as follows:

Define a custom script filter. For this discussion assume it is called AcceptDistinctDocumentScriptFilter
This custom filter takes in a list of primary keys as input.
These primary keys are the fields whose values will be used to determine uniqueness of records.
Now, instead of using aggregation we use normal search request and pass the custom script filter to the request.
If the search already has a filter\query criteria defined then append the custom filter using logical AND operator.
Following is example using pseudo syntax if the request is: select * from myindex where file_hash = 'hash_value' then append the custom filter as:
select * from myindex where file_hash = 'hash_value' AND AcceptDistinctDocumentScriptFilter(params= ['file_name', 'file_folder'])

For distributed search this is tricky and needs plugin to hook into QUERY phase. More details here.