What is the ideal bulk size formula in ElasticSear

I believe there should be a formula to calculate bulk indexing size in ElasticSearch. Probably followings are the variables of such a formula.

Number of nodes
Number of shards/index
Document size
RAM
Disk write speed
LAN speed

I wonder If anyone know or use a mathematical formula. If not, how people decide their bulk size? By trial and error?

标签： elasticsearch elasticsearch-bulk-api

5条回答

我欲成王，谁敢阻挡

2楼-- · 2019-06-15 14:57

I was searching about it and i found your question :) i found this in elastic documentation .. so i will investigate the size of my documents.

It is often useful to keep an eye on the physical size of your bulk requests. One thousand 1KB documents is very different from one thousand 1MB documents. A good bulk size to start playing with is around 5-15MB in size

0人赞添加讨论(0) 举报

Lonely孤独者°

3楼-- · 2019-06-15 15:00

Read ES bulk API doc carefully: https://www.elastic.co/guide/en/elasticsearch/guide/current/indexing-performance.html#_using_and_sizing_bulk_requests

Try with 1 KiB, try with 20 KiB, then with 10 KiB, ... dichotomy
Use bulk size in KiB (or equivalent), not document count !
Send data in bulk (no streaming), pass redundant info API url if you can
Remove superfluous whitespace in your data if possible
Disable search index updates, activate it back later
Round-robin across all your data nodes

0人赞添加讨论(0) 举报

在下西门庆

4楼-- · 2019-06-15 15:02

There is no golden rule for this. Extracted from the doc:

There is no “correct” number of actions to perform in a single bulk call. You should experiment with different settings to find the optimum size for your particular workload.

0人赞添加讨论(0) 举报

Bombasti

5楼-- · 2019-06-15 15:09

I derived this information from the Java API's BulkProcessor class. It defaults to 1000 actions or 5MB, it also allows you to set a flush interval but this is not set by default. I'm just using the default settings.

I'd suggest using BulkProcessor if you are using the Java API.

0人赞添加讨论(0) 举报

Lonely孤独者°

6楼-- · 2019-06-15 15:14

I haven't found a better way than trial and error (i.e. the traditional engineering process), as there are many factors beyond hardware influencing indexing speed: the structure/complexity of your index (complex mappings, filters or analyzers), data types, whether your workload is I/O or CPU bound, and so on.

In any case, to demonstrate how variable it can be, I can share my experience, as it seems different from most posted here:

Elastic 5.6 with 10GB heap running on a single vServer with 16GB RAM, 4 vCPU and an SSD that averages 150 MB/s while searching.

I can successfully index documents of wildly varying sizes via the http bulk api (curl) using a batch size of 10k documents (20k lines, file sizes between 25MB and 79MB), each batch taking ~90 seconds. index.refresh_interval is set to -1 during indexing, but that's about the only "tuning" I did, all other configurations are the default. I guess this is mostly due to the fact that the index itself is not too complex.

The vServer is at about 50% CPU, SSD averaging at 40 MB/s and 4GB RAM free, so I could probably make it faster by sending two files in parallel (I've tried simply increasing the batch size by 50% but started getting errors), but after that point it probably makes more sense to consider a different API or simply spreading the load over a cluster.

0人赞添加讨论(0) 举报

What is the ideal bulk size formula in ElasticSear

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间