Limit ElasticSearch aggregation to top n query res

2019-01-18 05:03发布

问题:

I have a set of 2.8 million docs with sets of tags that I'm querying with ElasticSearch, but many of these docs can be grouped together by one ID. I want to query my data using the tags, and then aggregate them by the ID that repeats. Often my search results have tens of thousands of documents, but I only want to aggregate the top 100 results of the search. How can I constrain an aggregation to only the top 100 results from a query?

回答1:

Sampler Aggregation :

A filtering aggregation used to limit any sub aggregations' processing to a sample of the top-scoring documents.

"aggs": {
     "bestDocs": {
         "sampler": {
          //    "field": "<FIELD>", <-- optional, Controls diversity using a field
              "shard_size":100
         },
         "aggs": {
              "bestBuckets": {
                 "terms": {
                      "field": "id"
                  }
               }
         }
      }
  }

This query will limit the sub aggregation to top 100 docs from the result and then bucket them by ID.

Optionally, you can use the field or script and max_docs_per_value settings to control the maximum number of documents collected on any one shard which share a common value.



回答2:

The size parameter can be set to define how many term buckets should be returned out of the overall terms list.

By default, the node coordinating the search process will request each shard to provide its own top size term buckets and once all shards respond, it will reduce the results to the final list that will then be returned to the client. This means that if the number of unique terms is greater than size, the returned list is slightly off and not accurate (it could be that the term counts are slightly off and it could even be that a term that should have been in the top size buckets was not returned).

If set to 0, the size will be set to Integer.MAX_VALUE.

Here is an example code to return top 100:

{
"aggs" : {
    "products" : {
        "terms" : {
            "field" : "product",
            "size" : 100
                  }
                 }
         }
}

You can refer to this for more information.



回答3:

You can use the min_doc_count parameter

{
"aggs" : {
    "products" : {
        "terms" : {
            "field" : "product",
            "min_doc_count" : 100
                  }
                 }
         }
}