How to use Bulk API to store the keywords in ES by

I have to store some message in ElasticSearch integrate with my python program. Now what I try to store the message is:

d={"message":"this is message"}
    for index_nr in range(1,5):
        ElasticSearchAPI.addToIndex(index_nr, d)
        print d

That means if I have 10 message then I have to repeat my code 10 times. So what I want to do is try to make a script file or batch file. I check ElasticSearch Guide, BULK API is possible to use. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-bulk.html The format should be something like below:

{ "index" : { "_index" : "test", "_type" : "type1", "_id" : "1" } }
{ "field1" : "value1" }
{ "delete" : { "_index" : "test", "_type" : "type1", "_id" : "2" } }
{ "create" : { "_index" : "test", "_type" : "type1", "_id" : "3" } }
{ "field1" : "value3" }
{ "update" : {"_id" : "1", "_type" : "type1", "_index" : "index1"} }
{ "doc" : {"field2" : "value2"} }

what I did is:

{"index":{"_index":"test1","_type":"message","_id":"1"}}
{"message":"it is red"}
{"index":{"_index":"test2","_type":"message","_id":"2"}}
{"message":"it is green"}

I also use curl tool to store the doc.

$ curl -s -XPOST localhost:9200/_bulk --data-binary @message.json

Now I want to use my Python code to store the file to the Elastic Search.

标签： python elasticsearch elasticsearch-bulk-api

4条回答

欢心

2楼-- · 2019-01-06 10:48

from datetime import datetime

from elasticsearch import Elasticsearch
from elasticsearch import helpers

es = Elasticsearch()

actions = [
  {
    "_index": "tickets-index",
    "_type": "tickets",
    "_id": j,
    "_source": {
        "any":"data" + str(j),
        "timestamp": datetime.now()}
  }
  for j in range(0, 10)
]

helpers.bulk(es, actions)

0人赞添加讨论(0) 举报

狗以群分

3楼-- · 2019-01-06 10:50

Although @justinachen 's code helped me start with py-elasticseearch, after looking in the source code let me do a simple improvement:

es = Elasticsearch()
j = 0
actions = []
while (j <= 10):
    action = {
        "_index": "tickets-index",
        "_type": "tickets",
        "_id": j,
        "_source": {
            "any":"data" + str(j),
            "timestamp": datetime.now()
            }
        }
    actions.append(action)
    j += 1

helpers.bulk(es, actions)

helpers.bulk() already does the segmentation for you. And by segmentation I mean the chucks sent every time to the server. If you want to reduce the chunk of sent documents do: helpers.bulk(es, actions, chunk_size=100)

Some handy info to get started:

helpers.bulk() is just a wrapper of the helpers.streaming_bulk but the first accepts a list which makes it handy.

helpers.streaming_bulk has been based on Elasticsearch.bulk() so you do not need to worry about what to choose.

So in most cases helpers.bulk() should be all you need.

0人赞添加讨论(0) 举报

再贱就再见

4楼-- · 2019-01-06 10:59

(the other approaches mentioned in this thread use python list for the ES update, which is not a good solution today, especially when you need to add millions of data to ES)

Better approach is using python generators -- process gigs of data without going out of memory or compromising much on speed.

Below is an example snippet from a practical use case - adding data from nginx log file to ES for analysis.

def decode_nginx_log(_nginx_fd):
    for each_line in _nginx_fd:
        # Filter out the below from each log line
        remote_addr = ...
        timestamp   = ...
        ...

        # Index for elasticsearch. Typically timestamp.
        idx = ...

        es_fields_keys = ('remote_addr', 'timestamp', 'url', 'status')
        es_fields_vals = (remote_addr, timestamp, url, status)

        # We return a dict holding values from each line
        es_nginx_d = dict(zip(es_fields_keys, es_fields_vals))

        # Return the row on each iteration
        yield idx, es_nginx_d   # <- Note the usage of 'yield'

def es_add_bulk(nginx_file):
    # The nginx file can be gzip or just text. Open it appropriately.
    ...

    es = Elasticsearch(hosts = [{'host': 'localhost', 'port': 9200}])

    # NOTE the (...) round brackets. This is for a generator.
    k = ({
            "_index": "nginx",
            "_type" : "logs",
            "_id"   : idx,
            "_source": es_nginx_d,
         } for idx, es_nginx_d in decode_nginx_log(_nginx_fd))

    helpers.bulk(es, k)

# Now, just run it.
es_add_bulk('./nginx.1.log.gz')

This skeleton demonstrates the usage of generators. You can use this even on a bare machine if you need to. And you can go on expanding on this to tailor to your needs quickly.

Python Elasticsearch reference here.

0人赞添加讨论(0) 举报

smile是对你的礼貌

5楼-- · 2019-01-06 11:11

Define index name and document type with each entity:

es_client = Elasticsearch()

body = []
for entry in entries:
    body.append({'index': {'_index': index, '_type': 'doc', '_id': entry['id']}})
    body.append(entry)

response = es_client.bulk(body=body)

Provide the default index and document type with the method:

es_client = Elasticsearch()

body = []
for entry in entries:
    body.append({'index': {'_id': entry['id']}})
    body.append(entry)

response = es_client.bulk(index='my_index', doc_type='doc', body=body)

Works with:

ES version:6.4.0

ES python lib: 6.3.1

0人赞添加讨论(0) 举报

How to use Bulk API to store the keywords in ES by

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间