I believe there should be a formula to calculate bulk indexing size in ElasticSearch. Probably followings are the variables of such a formula.
- Number of nodes
- Number of shards/index
- Document size
- RAM
- Disk write speed
- LAN speed
I wonder If anyone know or use a mathematical formula. If not, how people decide their bulk size? By trial and error?
I was searching about it and i found your question :) i found this in elastic documentation .. so i will investigate the size of my documents.
Read ES bulk API doc carefully: https://www.elastic.co/guide/en/elasticsearch/guide/current/indexing-performance.html#_using_and_sizing_bulk_requests
There is no golden rule for this. Extracted from the doc:
I derived this information from the Java API's BulkProcessor class. It defaults to 1000 actions or 5MB, it also allows you to set a flush interval but this is not set by default. I'm just using the default settings.
I'd suggest using BulkProcessor if you are using the Java API.
I haven't found a better way than trial and error (i.e. the traditional engineering process), as there are many factors beyond hardware influencing indexing speed: the structure/complexity of your index (complex mappings, filters or analyzers), data types, whether your workload is I/O or CPU bound, and so on.
In any case, to demonstrate how variable it can be, I can share my experience, as it seems different from most posted here:
Elastic 5.6 with 10GB heap running on a single vServer with 16GB RAM, 4 vCPU and an SSD that averages 150 MB/s while searching.
I can successfully index documents of wildly varying sizes via the http bulk api (curl) using a batch size of 10k documents (20k lines, file sizes between 25MB and 79MB), each batch taking ~90 seconds. index.refresh_interval is set to -1 during indexing, but that's about the only "tuning" I did, all other configurations are the default. I guess this is mostly due to the fact that the index itself is not too complex.
The vServer is at about 50% CPU, SSD averaging at 40 MB/s and 4GB RAM free, so I could probably make it faster by sending two files in parallel (I've tried simply increasing the batch size by 50% but started getting errors), but after that point it probably makes more sense to consider a different API or simply spreading the load over a cluster.