I am trying to figure out why elasticsearch is so slow at indexing. I am unsure if it is a limitation of elasticsearch itself or not but I will share what I have so far.
I have a single elasticsearch node and a logstash instance running on a box. My documents have about 15 fields and I have an elastic search mapping setup with the correct types (although I have tried without the mapping and get pretty much identical results).
I am indexing roughly 8 - 10 million events at a time and have taken the following approaches.
bulk api with the following format (I converted the csv to JSON and placed it into a file which I curl in
{"create" : {}}
{"field1" : "value1", "field2" : "value2 .... }
{"create" : {}}
{"field1" : "value1", "field2" : "value2 .... }
{"create" : {}}
{"field1" : "value1", "field2" : "value2 .... }
I have also tried logstash using both a tcp input with the original csv or using a file listener and cat the csv to the end of a file logstash is listening to.
All three of these methods seem to ingest around 10,000 events per second which is very slow.
Am I doing something wrong? Should I be explicitly assigning an id in my bulk ingest rather than letting it auto generate one?
When ingesting through the bulk API I have split the events up into 50,000 and 100,000 event files and ingested each separately.
I recommend this blog. Adjusting the following parameters should help during bulk indexing, but once you are done, reduce refresh_interval.
Youll find I done some research on this here, you can download the Indexing Scripts file and this has some useful scripts to maximise indexing performance. It really does vary on hardware and optimisation of Elasticsearch for indexing. I.e. Removal of replica nodes etc.
Hope this helps you somewhat.