Get the latest document version and aggregate the

2019-09-13 01:49发布

问题:

My index contains a lot of documents, each of them has several versions, for example:

{"doc_id": 13,
"version": 1,
"text": "bar"}

{"doc_id": 13,
"version": 2,
"text": "bar"}

{"doc_id": 13,
"version": 3,
"text": "bar"}

{"doc_id": 14,
"version": 1,
"text": "foo"}

{"doc_id": 14,
"version": 2,
"text": "bar"}

I want to get the last version for each document, and aggregate them (last versions) using terms aggregation.
I've tried to use top hits to retrieve last versions:

{"size" :0,
"aggs" : {
    "doc_id_groups" : {
        "terms" : {
            "field" : "doc_id",
            "size" : "0"
        },
        "aggs" : {
            "docs" : {
                "top_hits" : {
                    "size" : 1,
                    "sort" : {
                        "version" : {
                            "order" : "desc"
                        }
                    }
                }
            }
        }
    }
}
}

But I can't do aggregation, because top hits doesn't support sub aggregations.
I guess retrieving ids and then aggregating them would be very heavy operation for the client.
Maybe scripting could help?

Update: one thing I forgot to mention: before aggregating the documents are filtered by time range, so we don't know which version is the latest at index time, only at search time

回答1:

From the provided samples and additional details in chat I do not think you could achieve the required results using the aggregation. But I can propose an alternative solution instead:

  1. Add property "current" of type Boolean which will be set to true for all the latest versions of the documents. If a new version is inserted - "current" will be set to false in an older version and set to true in a newer one.
  2. Add property "timepoints" which will contain multiple values. In the end of the day (any other period can be used) for all the current records add the current timestamp (or any other id of the period, e.g. "09.30.2016", or "Jan") to the "timepoints" array.

Pros:

  • You can easily retrieve the current records at some point of time just checking whether the time point is in the "timepoints" array.

  • You can retrieve all the available time points from all the documents with a single query.

  • You can do the aggregation by time points, e.g. to count all the records at every point of time.

  • No need to maintain multiple indices, duplicates of the records etc., the algorithm is pretty straightforward.

Cons:

  • No possibility to get the current versions at an arbitrary point of time, just the ones when the calculation was performed.

  • The overall size of the "timepoints" arrays may increase significantly if you run the calculation too often and you have millions of records.

Workarounds:

  • For more fine grained statistics run the calculation on an hourly basis. But once a day (or month, or year) remove some of the time points from the "timepoints" array for older periods of time. In the end you will have a set of time points that will correspond to every year (in case it was more than a year ago), every month (in case it was more than a month ago), every day (in case it was more than a day ago), and every hour for the latest period. Of course the algorithm of removal of time points can be improved according to you needs.

  • If you are mostly working with the latest versions of the records - store them in a separate index, store the older versions in another one. In this case you don't even need the "current" property, just run through all the records in your current index and add the time stamp.

I can provide you all the queries you need for the above mentioned steps in case of a need.



回答2:

You should look at solving this client side. I can think of two ways to approach it.

  1. Use the scroll api to go through all the documents and find the latest version of each. Then again client side, aggregate by text.
  2. Use an elasticsearch terms aggregation on doc_id with a subaggregation of a max aggregation on version. This will give you the latest version for each document id. Then create a boolean OR terms filter that uses the doc_id and version from the first part. This filter should then have a terms aggregation on text.

Either way, you need to do some client side work. I don't believe scripting will help. If you already know the latest version number for each document, then this is a lot easier.