ElasticSearch setup for a large cluster with heavy

2020-05-23 03:29发布

Context and current state

We are migrating our cluster from Cassandra to a full ElasticSearch cluster. We are indexing documents at average of ~250-300 docs per seconds. In ElasticSearch 1.2.0 it represents ~8Go per day.

{
 "generic":
    {
      "id": "twi471943355505459200",
      "type": "twitter",
      "title": "RT @YukBerhijabb: The Life is Choice - https://m.facebook.com/story.php?story_fbid=637864496306297&id=100002482564531&refid=17",
      "content": "RT @YukBerhijabb: The Life is Choice - https://m.facebook.com/story.php?story_fbid=637864496306297&id=100002482564531&refid=17",
      "source": "<a href=\"https://twitter.com/download/android\" rel=\"nofollow\">Twitter for  Android</a>",
      "geo": null,
      "link": "http://twitter.com/rosi_sifah/status/471943355505459200",
      "lang": "en",
      "created_at": 1401355038000,
      "author": {
        "username": "rosi_sifah",
        "name": "Rosifah",
        "id": 537798506,
        "avatar": "http://pbs.twimg.com/profile_images/458917673456238592/Im22zoIV_normal.jpeg",
        "link": "http://twitter.com/rosi_sifah"
      }
    },
 "twitter": {
   // a tweet JSON
 }
}

Our users save requests in our SQL database and when they ask for their dashboard we would like to request our ES cluster with their query (retrieved from database) and do some aggregation on top of it using the new ES aggregation framework.

Each dashboard is displayed with an explicit, user selected, date range so we always use

"range": {
 "generic.created_at": {
   "from": 1401000000000,
   "to": 1401029019706
  }
}

along with the ES query.

We specified _routing that way:

"_routing":{
 "required":true,
 "path":"generic.id"
},

and the _id with:

"_id": {
  "index": "not_analyzed",
  "store": "false",
  "path": "generic.id"
}

For approximately 5 days we've stored 67 millions documents (about 40Go) inside one index. We've learn about the good practice of spliting the index by day. So now our indices are splitted by day ([index-name]-[YYYY-MM-DD]).

Currently each index has 5 shards and 1 replica, we have a cluster composed of 3 machines each with 8 cores, 16Go of RAM and 8To of HDD. We plan to use another machine as a gateway (8 cores, 16Go of RAM, 1To of HDD).

We've leaved ES configuration by default besides the cluster configuration.

Questions

  1. For each document we want to index, we say explicitly what index to use. Currently we use the date of the day. Should we use the date of the document in order to prevent hot spot? Because currently it means that documents from various days (specified inside their created_at) can live in the same index of the current day.
  2. Are 5 shards enough (or too much) for 21 600 000 documents by day?
  3. If we want all our aggregate queries to be processed in less than 1 second how many replica should we setup up?
  4. Should we change our routing? Since we don't know ahead of time which documents will be processed before the aggregation for each request we make to the cluster (since the query is user defined)
  5. What kind of hardware (how many machines, what configuration) should we put inside this cluster to support 6 month of documents?

[Update]

Here is some example of queries:

A word cloud

GET idx-2014-05-01/stream/_search?search_type=count
{
 "query":{
   "bool": {
     "must": [{
       "query_string" : {
         "query" : "(generic.lang:fr OR generic.lang:en) AND (generic.content:javascript)"
        }},{
        "range": {
          "generic.created_at": {
            "from": 1401000000000,
            "to": 1401029019706
          }
        }}
     ]
   }
 },
  "aggs":{
    "words":{
      "terms":{
        "field": "generic.content",
        "size": 40
      }
    }
  }
}

An histogram

GET idx-2014-05-01/stream/_search?search_type=count
{
 "query":{
   "bool": {
     "must": [{
       "query_string" : {
         "query" : "generic.content:apple"
        }},{
        "range": {
          "generic.created_at": {
            "from": 1401000000000,
            "to": 1401029019706
          }
        }}
     ]
   }
 },
  "aggs":{
    "volume":{
      "date_histogram":{
        "field": "generic.created_at",
        "interval":"minute"
      }
    }
  }
}

Must used language

GET idx-2014-05-01/stream/_search?search_type=count
{
 "query":{
   "bool": {
     "must": [{
       "query_string" : {
         "query" : "(generic.lang:fr OR generic.lang:en) AND (generic.content:javascript)"
        }},{
        "range": {
          "generic.created_at": {
            "from": 1401000000000,
            "to": 1401029019706
          }
        }}
     ]
   }
 },
  "aggs":{
    "top_source":{
      "terms":{
        "field": "generic.lang"
      }
    }
  }
}

1条回答
欢心
2楼-- · 2020-05-23 04:29

Let me preface all of my answers/comments with the advice to try to, as much as possible, test these scenarios yourself. While Elasticsearch is very scalable, there are many tradeoffs that are heavily impacted by document size and type, ingestion and query volume, hardware and OS. While there are many wrong answers there is rarely one right answer.

I'm basing this response on a couple of active clusters with have with (currently) about half a million active documents in them, plus some recent benchmarking we performed at about 4X your volume (around 80M documents ingested per day during the benchmark).

1) First off you are not creating much of a hot spot with 3 nodes when you have even a single index with 5 shards and 1 replica per shard. Elasticsearch will separate each replica from it's primary to a different node, and in general will try to balance out the load of shards. Elasticsearch by default will hash on ID to pick the shard to index into (which then gets copied to the replica). Even with routing, you will only have a hot spot issue if you have single IDs that create large numbers of documents per day (which is the span of your index). Even then, it would not be a problem unless these IDs produce a significant percentage of the overall volume AND there are so few of them that you could get clumping on just 1 or 2 of the shards.

Only you can determine that based on your usage patterns - I'd suggest both some analysis of your existing data set to look for overly large concentrations and an analysis of your likely queries.

A bigger question I have is the nature of your queries. You aren't showing the full query nor the full schema (I see "generic.id" referenced but not in the document schema, and your query shows pulling up every single document within a time range - is that correct?). Custom routing for indexing is most useful when your queries are bound by an exact match on the field used for routing. So, if I had an index with everyone's documents in it and my query pattern was to only retrieve a single user's document in a single query, then custom routing by user id would be very useful to improve query performance and reduce overall cluster load.

A second thing to consider is the overall balance of ingestion vs. queries. You are ingesting over 20M documents a day - how many queries are you executing per day? If that number is <<< the ingestion rate you may want to think thru the need for a custom route. Also, if query performance is good or great you may not want to add the additional complexity.

Finally on indexing by ingestion date vs. created_at. We've struggled with that one too as we have some lag in receiving new documents. For now we've gone with storing by ingestion date as it's easier to manage and not a big issue to query multiple indexes at a time, particularly if you auto-create aliases for 1 week, 2 weeks, 1 month, 2 months etc. A bigger issue is what the distribution is - if you have documents that come in weeks or months later perhaps you want to change to indexing by created_at but that will require keeping those indexes online and open for quite some time.

We currently use several document indexes per day, basically "--" format. Practically this currently means 5 indexes per day. This allows us to be more selective about moving data in and out of the cluster. Not a recommendation for you just something that we've learned is useful to us.

2) Here's the great think about ES - with creating a new index each day you can adjust as time goes on to increase the number of shards per index. While you cannot change it for an existing index you are creating a new one every day and you can base your decision on real production analytics. You certainly want to watch the number and be prepared to increase the number of shards as/if you ingestion per day increases. It's not the simplest tradeoff - each one of those shards is a Lucene instance which potentially has multiple files. More shards per index is not free, as that multiplies with time. Given your use case of 6 months, that's over 1800 shards open across 3 nodes (182 days x 5 primaries and 5 replicas per day). There's multiples of files per shard likely open. We've found some level of overhead and impact on resource usage on our nodes as total shard count increased in the cluster into these ranges. Your mileage may vary but I'd be careful about increasing the number of shards per index when you are planning on keeping 182 indexes (6 months) at a time - that's quite a multiplier. I would definitely benchmark that ahead of time if you do make any changes to the default shard count.

3) There's not any way anyone else can predict query performance ahead of time for you. It's based on overall cluster load, query complexity, query frequency, hardware, etc. It's very specific to your environment. You are going to have to benchmark this. Personally given that you've already loaded data I'd use the ES snapshot and restore to bring this data up in a test environment. Try it with the default of 1 replica and see how it goes. Adding replica shards is great for data redundancy and can help spread out queries across the cluster but it comes at a rather steep price - 50% increase in storage plus each additional replica shard will bring additional ingestion cost to the node it runs on. It's great if you need the redundancy and can spare the capacity, not so great if you lack sufficient query volume to really take advantage of it.

4) Your question is incomplete (it ends with "we never") so I can't answer it directly - however a bigger question is why are you custom routing to begin with? Sure it can have great performance benefits but it's only useful if you can segment off a set of documents by the field you use to route. It's not entirely clear from your example data and partial query if that's the case. Personally I'd test it without any custom routing and then try the same with it and see if it has a significant impact.

5) Another question that will require some work on your part. You need to be tracking (at a minimum) JVM heap usage, overall memory and cpu usage, disk usage and disk io activity over time. We do, with thresholds set to alert well ahead of seeing issues so that we can add new members to the cluster early. Keep in mind that when you add a node to a cluster ES is going to try to re-balance the cluster. Running production with only 3 nodes with a large initial document set can cause issues if you lose a node to problems (Heap exhaustion, JVM error, hardware failure, network failure, etc). ES is going to go Yellow and stay there for quite some time while it reshuffles.

Personally for large document numbers and high ingestion I'd start adding nodes earlier. With more nodes in place it's less of an issue if you take a node out for maintenance. Concerning your existing configuration, how did you get to 8 TB of HDD per node? Given an ingestion of 8GB a day that seems like overkill for 6 months of data. I'd strongly suspect given the volume of data and number of indexes/shards you will want to move to more nodes which will even further reduce your storage per node requirement.

I'd definitely want to benchmark a maximum amount of documents per node by looping thru high volume ingestion and loops of normal query frequency on a cluster with just 1 or 2 nodes and see where it fails (either in performance, heap exhaustion or other issue). I'd then plan to keep the number of documents per node well below that number.

All that said I'd go out on a limb and say I doubt you'll be all that happy with 4 billion plus documents on 3 16GB nodes. Even if it worked (again, test, test, test) losing one node is going to be a really big event. Personally I like the smaller nodes but prefer lots of them.

Other thoughts - we initially benchmarked on 3 Amazon EC2 m1.xlarge instances (4 cores, 15 GB of memory) which worked fine over several days of ingestion at 80M documents a day which larger average document size than you appear to have. Biggest issue was the number of indexes and shards open (we were creating a couple of hundred new indexes per day with maybe a couple thousand more shards per day and this was creating issues). We've since added a couple of new nodes that have 30GB of memory and 8 cores and then added another 80M documents to test it out. Our current production approach is to keep prefer more moderately sized nodes as opposed to a few large ones.

UPDATE:

Regarding the benchmarking hardware, it was as stated above benchmarked on 3 Amazon EC2 m1.xlarge virtual instances running ubuntu 12.04 LTS and ES 1.1.0. We ran at about 80M documents a day (pull data out of a MongoDB database we had previously used). Each instance had 1 TB of storage via Amazon EBS, with provision IOPS of I believe 1000 IOPS. We ran for about 4-5 days. We appear to have been a bit cpu constrained at 80M a day and believe that more nodes would have increased our ingestion rate. As the benchmark ran and the number of indexes and shards increased we saw increasing memory pressure. We created a large number of indexes and shards (roughly 4 -5 indexes per 1 M documents or about 400 indexes per day, with 5 primary shards and 1 replica shard per index).

Regarding the index aliases, we're creating via a cron entry rolling index aliases for 1 week back, 2 weeks back, etc so that our application can just hit a known index alias and always run against a set time frame back from today. We're using the index aliases rest api to create and delete them:

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-aliases.html

查看更多
登录 后发表回答