I need to update or delete several documents.
When I update I do this:
- I first search for the documents, setting a greater limit for the returned results (let’s say, size: 10000).
- For each of the returned documents, I modify certain values.
- I resent to elasticsearch the whole modified list (bulk index).
This operation takes place until point 1 no longer returns results.
When I delete I do this:
- I first search for the documents, setting a greater limit for the returned results (let’s say, size: 10000)
- I delete every found document sending to elasticsearch _id document (10000 requests)
This operation repeats until point 1 no longer returns results.
Is this the right way to make an update?
When I delete, is there a way I can send several ids to delete multiple documents at once?
For deletion and update, if you want to delete or update by id you can use the bulk api:
Bulk API
The bulk API makes it possible to perform many index/delete operations
in a single API call. This can greatly increase the indexing speed.
The possible actions are index, create, delete and update. index and
create expect a source on the next line, and have the same semantics
as the op_type parameter to the standard index API (i.e. create will
fail if a document with the same index and type exists already,
whereas index will add or replace a document as necessary). delete
does not expect a source on the following line, and has the same
semantics as the standard delete API. update expects that the partial
doc, upsert and script and its options are specified on the next line.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-bulk.html
You can also delete by query instead:
Delete By Query API
The delete by query API allows to delete documents from one or more
indices and one or more types based on a query. The query can either
be provided using a simple query string as a parameter, or using the
Query DSL defined within the request body.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-delete-by-query.html
For your massive index/update operation, if you don't use it already (not sure), you can take a look at the bulk api documentation. it is tailored for this kind of job.
If you want to retrieve lots of documents by small batches, you should use the scan-scroll
search instead of using from/size
. Related information can be found here.
To sum up :
scroll
api is used to load results in memory and to be able to iterate over it efficiently
scan
search type disable sorting, which is costly
Give it a try, depending on the data volume, it could improve the performance of your batch operations.
For the delete operation, you can use this same _bulk
api to send multiple delete operation at once.
The format of each line is the following :
{ "delete" : { "_index" : "indexName", "_type" : "typeName", "_id" : "1" } }
{ "delete" : { "_index" : "indexName", "_type" : "typeName", "_id" : "2" } }