My index contains a lot of documents, each of them has several versions, for example:
{"doc_id": 13,
"version": 1,
"text": "bar"}
{"doc_id": 13,
"version": 2,
"text": "bar"}
{"doc_id": 13,
"version": 3,
"text": "bar"}
{"doc_id": 14,
"version": 1,
"text": "foo"}
{"doc_id": 14,
"version": 2,
"text": "bar"}
I want to get the last version for each document, and aggregate them (last versions) using terms
aggregation.
I've tried to use top hits
to retrieve last versions:
{"size" :0,
"aggs" : {
"doc_id_groups" : {
"terms" : {
"field" : "doc_id",
"size" : "0"
},
"aggs" : {
"docs" : {
"top_hits" : {
"size" : 1,
"sort" : {
"version" : {
"order" : "desc"
}
}
}
}
}
}
}
}
But I can't do aggregation, because top hits
doesn't support sub aggregations.
I guess retrieving ids and then aggregating them would be very heavy operation for the client.
Maybe scripting could help?
Update: one thing I forgot to mention: before aggregating the documents are filtered by time range, so we don't know which version is the latest at index time, only at search time
From the provided samples and additional details in chat I do not think you could achieve the required results using the aggregation. But I can propose an alternative solution instead:
- Add property "current" of type Boolean which
will be set to true for all the latest versions of the documents. If
a new version is inserted - "current" will be set to false
in an older version and set to true in a newer one.
- Add property "timepoints" which will contain multiple values. In the end of the day (any other period can be used) for all the
current records add the current timestamp (or any other id of the
period, e.g. "09.30.2016", or "Jan") to the "timepoints"
array.
Pros:
You can easily retrieve the current records at some point of time just checking whether the time point is in the "timepoints" array.
You can retrieve all the available time points from all the documents with a single query.
You can do the aggregation by time points, e.g. to count all the records at every point of time.
No need to maintain multiple indices, duplicates of the records etc., the algorithm is pretty straightforward.
Cons:
No possibility to get the current versions at an arbitrary point of time, just the ones when the calculation was performed.
The overall size of the "timepoints" arrays may increase significantly if you run the calculation too often and you have millions of records.
Workarounds:
For more fine grained statistics run the calculation on an hourly basis. But once a day (or month, or year) remove some of the time points from the "timepoints" array for older periods of time. In the end you will have a set of time points that will correspond to every year (in case it was more than a year ago), every month (in case it was more than a month ago), every day (in case it was more than a day ago), and every hour for the latest period. Of course the algorithm of removal of time points can be improved according to you needs.
If you are mostly working with the latest versions of the records - store them in a separate index, store the older versions in another one. In this case you don't even need the "current" property, just run through all the records in your current index and add the time stamp.
I can provide you all the queries you need for the above mentioned steps in case of a need.
You should look at solving this client side. I can think of two ways to approach it.
- Use the scroll api to go through all the documents and find the latest version of each. Then again client side, aggregate by
text
.
- Use an elasticsearch terms aggregation on
doc_id
with a subaggregation of a max aggregation on version
. This will give you the latest version for each document id. Then create a boolean OR terms filter that uses the doc_id and version from the first part. This filter should then have a terms aggregation on text
.
Either way, you need to do some client side work. I don't believe scripting will help. If you already know the latest version number for each document, then this is a lot easier.