I have to store some message in ElasticSearch integrate with my python program. Now what I try to store the message is:
d={"message":"this is message"}
for index_nr in range(1,5):
ElasticSearchAPI.addToIndex(index_nr, d)
print d
That means if I have 10 message then I have to repeat my code 10 times. So what I want to do is try to make a script file or batch file. I check ElasticSearch Guide, BULK API is possible to use. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-bulk.html The format should be something like below:
{ "index" : { "_index" : "test", "_type" : "type1", "_id" : "1" } }
{ "field1" : "value1" }
{ "delete" : { "_index" : "test", "_type" : "type1", "_id" : "2" } }
{ "create" : { "_index" : "test", "_type" : "type1", "_id" : "3" } }
{ "field1" : "value3" }
{ "update" : {"_id" : "1", "_type" : "type1", "_index" : "index1"} }
{ "doc" : {"field2" : "value2"} }
what I did is:
{"index":{"_index":"test1","_type":"message","_id":"1"}}
{"message":"it is red"}
{"index":{"_index":"test2","_type":"message","_id":"2"}}
{"message":"it is green"}
I also use curl tool to store the doc.
$ curl -s -XPOST localhost:9200/_bulk --data-binary @message.json
Now I want to use my Python code to store the file to the Elastic Search.
Although @justinachen 's code helped me start with py-elasticseearch, after looking in the source code let me do a simple improvement:
helpers.bulk()
already does the segmentation for you. And by segmentation I mean the chucks sent every time to the server. If you want to reduce the chunk of sent documents do:helpers.bulk(es, actions, chunk_size=100)
Some handy info to get started:
helpers.bulk()
is just a wrapper of thehelpers.streaming_bulk
but the first accepts a list which makes it handy.helpers.streaming_bulk
has been based onElasticsearch.bulk()
so you do not need to worry about what to choose.So in most cases helpers.bulk() should be all you need.
(the other approaches mentioned in this thread use python list for the ES update, which is not a good solution today, especially when you need to add millions of data to ES)
Better approach is using python generators -- process gigs of data without going out of memory or compromising much on speed.
Below is an example snippet from a practical use case - adding data from nginx log file to ES for analysis.
This skeleton demonstrates the usage of generators. You can use this even on a bare machine if you need to. And you can go on expanding on this to tailor to your needs quickly.
Python Elasticsearch reference here.
Define index name and document type with each entity:
Provide the default index and document type with the method:
Works with:
ES version:6.4.0
ES python lib: 6.3.1