Elasticsearch average over date histogram buckets

I've got a bunch of documents indexed in ElasticSearch, and I need to get the following data:

For each month, get the average number of documents per working day of the month (or if impossible, use 20 days as the default).

I already aggregated my data into months buckets using the date histogram aggregation. I tried to nest a stats bucket, but this aggregations uses data extracted from the document's field, not from the parent bucket.

Here is my query so far:

{
    "query": {
        "match_all": {}
    },
    "aggs": {
        "docs_per_month": {
            "date_histogram": {
                "field": "created_date",
                "interval": "month",
                "min_doc_count": 0
            }
            "aggs": {
                '???': '???'
            }
        }
    }
}

edit

To make my question clearer, what I need is:

Get the total of numbers of documents created for the month (which is already done thanks to the date_histogram aggregation)
Get the number of working days for the month
Divide the first by the second.

标签： elasticsearch facets date-histogram

4条回答

倾城　Initia

2楼-- · 2019-05-03 11:32

What you basically need is something like this (which doesn't work, as it's not an available feature):

{
  "query": {
    "match_all": {}
  },
  "aggs": {
    "docs_per_month": {
      "date_histogram": {
        "field": "date",
        "interval": "month",
        "min_doc_count": 0
      },
      "aggs": {
        "average": {
          "avg": {
            "script": "doc_count / 20"
          }
        }
      }
    }
  }
}

It doesn't work because there is not way of accessing the doc_count from the "parent" aggregation.

But, this will be possible in the 2.x branch of Elasticsearch and, at the moment, it's being actively developed: https://github.com/elastic/elasticsearch/issues/8110 This new feature will add a second layer of manipulation over the results (buckets) of an aggregation and it's not only your usecase, but many others.

Unless you want to try some ideas out there or perform your own calculations in your app, you need to wait for this feature.

0人赞添加讨论(0) 举报

劳资没心，怎么记你

3楼-- · 2019-05-03 11:33

For anyone still interested, you can now do with with the avg_bucket aggregation. Its still a bit tricky, because you cannot simply run the avg_bucket on a date_historgram aggregation result, but with a secondary value_count aggregation with some unique value and it works fine :)

{
  "size": 0,
  "aggs": {
    "orders_per_day": {
      "date_histogram": {
        "field": "orderedDate",
        "interval": "day"
      },
      "aggs": {
        "amount": {
          "value_count": {
            "field": "dateCreated"
          }
        }
      }
    },
    "avg_daily_order": {
      "avg_bucket": {
        "buckets_path": "orders_per_day>amount"
      }
    }
  }
}

0人赞添加讨论(0) 举报

唯我独甜

4楼-- · 2019-05-03 11:38

You want to exclude documents with timestamp on Saturday and Sunday, so you can exclude those documents in your query using a script

{
  "query": {
    "filtered": {
      "filter": {
        "script": {
          "script": "doc['@timestamp'].date.dayOfWeek != 7 && doc['@timestamp'].date.dayOfWeek != 6"
        }
      }
    }
  },
  "aggs": {
    "docs_per_month": {
      "date_histogram": {
        "field": "created_date",
        "interval": "month",
        "min_doc_count": 0
      },
      "aggs": {
        "docs_per_day": {
          "date_histogram": {
            "field": "created_date",
            "interval": "day",
            "min_doc_count": 0
          }
        },
        "aggs": {
          "docs_count": {
            "avg": {
              "field": ""
            }
          }
        }
      }
    }
  }
}

You may not need the first aggregation by month, since you already have this information using day interval

BTW you need to make sure dynamic scripting is enabled by adding this to your elasticsearch.yml configuration

script.disable_dynamic: false

Or add a groovy script under /config/scripts and use a filtered query with a script in filter

0人赞添加讨论(0) 举报

家丑人穷心不美

5楼-- · 2019-05-03 11:44

There is a pretty convoluted solution and not really performant, using the following scripted_metric aggregation.

{
  "size": 0,
  "query": {
    "match_all": {}
  },
  "aggs": {
    "docs_per_month": {
      "date_histogram": {
        "field": "created_date",
        "interval": "month",
        "min_doc_count": 0
      },
      "aggs": {
        "avg_doc_per_biz_day": {
          "scripted_metric": {
            "init_script": "_agg.bizdays = []; _agg.allbizdays = [:]; start = new DateTime(1970, 1, 1, 0, 0); now = new DateTime(); while (start < now) { def end = start.plusMonths(1); _agg.allbizdays[start.year + '_' + start.monthOfYear] = (start.toDate()..<end.toDate()).sum {(it.day != 6 && it.day != 0) ? 1 : 0 }; start = end; }",
            "map_script": "_agg.bizdays << _agg.allbizdays[doc. created_date.date.year+'_'+doc. created_date.date.monthOfYear]",
            "combine_script": "_agg.allbizdays = null; doc_count = 0; for (d in _agg.bizdays){ doc_count++ }; return doc_count / _agg.bizdays[0]",
            "reduce_script": "res = 0; for (a in _aggs) { res += a }; return res"
          }
        }
      }
    }
  }
}

Let's detail each script below.

What I'm doing in init_script is creating a map of the number of business days for each month since 1970 and storing that in the _agg.allbizdays map.

_agg.bizdays = [];
_agg.allbizdays = [:]; 
start = new DateTime(1970, 1, 1, 0, 0);
now = new DateTime();
while (start < now) { 
    def end = start.plusMonths(1);     
    _agg.allbizdays[start.year + '_' + start.monthOfYear] = (start.toDate()..<end.toDate()).sum {(it.day != 6 && it.day != 0) ? 1 : 0 }; 
    start = end; 
}

In map_script, I'm simply retrieving the number of weekdays for the month of each document;

_agg.bizdays << _agg.allbizdays[doc.created_date.date.year + '_' + doc. created_date.date.monthOfYear];

In combine_script, I'm summing up the average doc count for each shard

_agg.allbizdays = null;
doc_count = 0; 
for (d in _agg.bizdays){ doc_count++ }; 
return doc_count / _agg.bizdays[0];

And finally in reduce_script, I'm summing up the average doc count for each node:

res = 0; 
for (a in _aggs) { res += a }; 
return res

Again I think it's pretty convoluted and as Andrei rightly said it, it is probably better to wait for 2.0 to make it work the way it should, but in the meantime you have this solution, if you need it.

0人赞添加讨论(0) 举报

Elasticsearch average over date histogram buckets

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间