I am trying to figure out why elasticsearch is so slow at indexing. I am unsure if it is a limitation of elasticsearch itself or not but I will share what I have so far.
I have a single elasticsearch node and a logstash instance running on a box. My documents have about 15 fields and I have an elastic search mapping setup with the correct types (although I have tried without the mapping and get pretty much identical results).
I am indexing roughly 8 - 10 million events at a time and have taken the following approaches.
bulk api with the following format (I converted the csv to JSON and placed it into a file which I curl in
{"create" : {}}
{"field1" : "value1", "field2" : "value2 .... }
{"create" : {}}
{"field1" : "value1", "field2" : "value2 .... }
{"create" : {}}
{"field1" : "value1", "field2" : "value2 .... }
I have also tried logstash using both a tcp input with the original csv or using a file listener and cat the csv to the end of a file logstash is listening to.
All three of these methods seem to ingest around 10,000 events per second which is very slow.
Am I doing something wrong? Should I be explicitly assigning an id in my bulk ingest rather than letting it auto generate one?
When ingesting through the bulk API I have split the events up into 50,000 and 100,000 event files and ingested each separately.