I am experiencing that bulk indexing performance using the .NET NEST client and ElasticSearch degrades over time with a constant amount of indexes and number of documents.
We are running ElasticSearch Version: 0.19.11, JVM: 23.5-b02
on a m1.large Amazon instance with Ubuntu Server 12.04.1 LTS 64 bit and Sun Java 7. There is nothing else running on this instance except what comes along with the Ubuntu install.
Amazon M1 Large Instance: from http://aws.amazon.com/ec2/instance-types/
7.5 GiB memory
4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each)
850 GB instance storage
64-bit platform
I/O Performance: High
EBS-Optimized Available: 500 Mbps
API name: m1.large
ES_MAX_MEM is set to 4g and ES_MIN_MEM is set to 2g
Every night we index/reindex ~15000 documents using NEST in our .NET application. At any given time there is only one index with <= 15000 documents.
when the server was first installed the indexing and search was fast for the first couple of days, then indexing started to get slower and slower. the bulk indexing indexes 100 documents at a time and after a while it would take up to 15s for a bulk operation to finish. after that we started to see alot of the following exception and the indexing grinding to a halt.
System.Net.WebException: The request was aborted: The request was canceled.
at System.Net.HttpWebRequest.EndGetResponse(IAsyncResult asyncResult)
at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization) :
The builk indexing implementation looks like this
private ElasticClient GetElasticClient()
{
var setting = new ConnectionSettings(ConfigurationManager.AppSettings["elasticSearchHost"], 9200);
setting.SetDefaultIndex("products");
var elastic = new ElasticClient(setting);
return elastic;
}
private void DisableRefreshInterval()
{
var elasticClient = GetElasticClient();
var s = elasticClient.GetIndexSettings("products");
var settings = s != null && s.Settings != null ? s.Settings : new IndexSettings();
settings["refresh_interval"] = "-1";
var result = elasticClient.UpdateSettings(settings);
if (!result.OK)
_logger.Warn("unable to set refresh_interval to -1, {0}", result.ConnectionStatus == null || result.ConnectionStatus.Error == null ? "" : result.ConnectionStatus.Error.ExceptionMessage);
}
private void EnableRefreshInterval()
{
var elasticClient = GetElasticClient();
var s = elasticClient.GetIndexSettings("products");
var settings = s != null && s.Settings != null ? s.Settings : new IndexSettings();
settings["refresh_interval"] = "1s";
var result = elasticClient.UpdateSettings(settings);
if (!result.OK)
_logger.Warn("unable to set refresh_interval to 1s, {0}", result.ConnectionStatus == null || result.ConnectionStatus.Error == null ? "" : result.ConnectionStatus.Error.ExceptionMessage);
}
public void Index(IEnumerable<Product> products)
{
var enumerable = products as Product[] ?? products.ToArray();
var elasticClient = GetElasticClient();
try
{
DisableRefreshInterval();
_logger.Info("Indexing {0} products", enumerable.Count());
var status = elasticClient.IndexMany(enumerable as IEnumerable<Product>, "products");
if (status.Items != null)
_logger.Info("Done, Indexing {0} products, duration: {1}", status.Items.Count(), status.Took);
if (status.ConnectionStatus.Error != null)
{
_logger.Error(status.ConnectionStatus.Error.OriginalException);
}
}
catch(Exception ex)
{
_logger.Error(ex);
}
finally
{
EnableRefreshInterval();
}
}
Restarting the elasticsearch daemon does not seem to make any difference whatsoever, but deleting the index and re-indexing everything does. But after a few days we would have the same slow-indexing problem.
I just deleted the index and added an Optimize after the re-enabling of the refresh interval after each bulk-index operation in the hope that this might keep the index from degrading.
...
...
finally
{
EnableRefreshInterval();
elasticClient.Optimize("products");
}
Am I doing something horribly wrong here?
Sorry - just started writing another quite long comment and thought I'd just stick it all in an answer in case it benefits someone else...
ES_HEAP_SIZE
The first thing I noticed here is that you said you set the max and min heap values for elasticsearch to different values. These should be the same. In the configuration / init.d script there should be an EX_HEAP_SIZE that you can set. Be sure to only set this (and not the min and max values) as it will set the min and max values to the same value which is what you want. If you don't do this the JVM will block java processes when you start to need more memory - see this great article of an outage at github very recently (here's a quote):
Also check out this great post for more elasticsearch config from the trenches.
Lock Memory to Stop Swapping
From my research I've found that you should also lock the amount of memory available to the java process to avoid memory swapping. I'm no expert in this field but what I've been told is that this will also kill performance. You can find bootstrap.mlockall in your elasticsearch.yml config file.
Upgrades
Elasticsearch is still quite new. Plan to upgrade fairly frequently as the bug fixes that have been introduced between the version you were on (0.19.11) and the current version (0.20.4) are very significant. See the ES site for details. You're on Java 7 which is definitely the right way to go, I started on Java 6 and realized quickly that it just wasn't good enough, especially for bulk inserting.
Plugins
Finally, to anyone else who experiences similar issues, get a decent plugin installed for an overview of your nodes and the JVM. I recommend bigdesk - run bigdesk and then hit elasticsearch with some bulk inserts and watch out for strange heap memory patterns, a very large number of threads etc, it's all there!
Hope someone finds this useful!
Cheers, James
Just to venture a guess:
As index performance degrades, do you notice the index takes up more space on disk?
It could be that, rather than replacing the old index or old documents when reindexing, instead you are adding a bunch of new documents, effectively doubling the document count with probably with largely duplicated data. Might be worth grabbing an aged, slow index and load it into some viewer to debug it (Luke, for instance). If you see a lot more documents than you were expecting, then you might look into making your rebuild create a new index to replace the old one, instead.
Since restarting the daemon doesn't fix the problem, I would suppose leaving open file handles, running processes, connections, etc. can be ruled out, though I would want to check those statistics, and determine if I see any suspect behavior on the server.
Also, regarding Optimize, you may see some performance enhacements with it, sure, but it is a very expensive operation. I would recommend only running an optimize after the full rebuild is completed, rather than after each incremental bulk index operation.