I have an application that users a lot of batch reads in Aerospike. After some time the application is running, I get a lot these errors:
play.api.Application$$anon$1: Execution exception[[AerospikeException: Error Code 9: Timeout]]
at play.api.Application$class.handleError(Application.scala:296) ~[com.typesafe.play.play_2.11-2.3.7.jar:2.3.7]
at play.api.DefaultApplication.handleError(Application.scala:402) [com.typesafe.play.play_2.11-2.3.7.jar:2.3.7]
at play.core.server.netty.PlayDefaultUpstreamHandler$$anonfun$3$$anonfun$applyOrElse$4.apply(PlayDefaultUpstreamHandler.scala:320) [com.typesafe.play.play_2.11-2.3.7.jar:2.3.7]
at play.core.server.netty.PlayDefaultUpstreamHandler$$anonfun$3$$anonfun$applyOrElse$4.apply(PlayDefaultUpstreamHandler.scala:320) [com.typesafe.play.play_2.11-2.3.7.jar:2.3.7]
at scala.Option.map(Option.scala:146) [org.scala-lang.scala-library-2.11.7.jar:na]
at play.core.server.netty.PlayDefaultUpstreamHandler$$anonfun$3.applyOrElse(PlayDefaultUpstreamHandler.scala:320) [com.typesafe.play.play_2.11-2.3.7.jar:2.3.7]
at play.core.server.netty.PlayDefaultUpstreamHandler$$anonfun$3.applyOrElse(PlayDefaultUpstreamHandler.scala:316) [com.typesafe.play.play_2.11-2.3.7.jar:2.3.7]
at scala.concurrent.Future$$anonfun$recoverWith$1.apply(Future.scala:344) [org.scala-lang.scala-library-2.11.7.jar:na]
at scala.concurrent.Future$$anonfun$recoverWith$1.apply(Future.scala:343) [org.scala-lang.scala-library-2.11.7.jar:na]
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) [org.scala-lang.scala-library-2.11.7.jar:na]
at play.api.libs.iteratee.Execution$trampoline$.execute(Execution.scala:46) [com.typesafe.play.play-iteratees_2.11-2.3.7.jar:2.3.7]
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40) [org.scala-lang.scala-library-2.11.7.jar:na]
at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248) [org.scala-lang.scala-library-2.11.7.jar:na]
at scala.concurrent.Promise$class.complete(Promise.scala:55) [org.scala-lang.scala-library-2.11.7.jar:na]
at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153) [org.scala-lang.scala-library-2.11.7.jar:na]
at scala.concurrent.Future$$anonfun$flatMap$1.apply(Future.scala:249) [org.scala-lang.scala-library-2.11.7.jar:na]
at scala.concurrent.Future$$anonfun$flatMap$1.apply(Future.scala:249) [org.scala-lang.scala-library-2.11.7.jar:na]
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) [org.scala-lang.scala-library-2.11.7.jar:na]
at scala.concurrent.forkjoin.ForkJoinTask$AdaptedRunnableAction.exec(ForkJoinTask.java:1361) [org.scala-lang.scala-library-2.11.7.jar:na]
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [org.scala-lang.scala-library-2.11.7.jar:na]
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [org.scala-lang.scala-library-2.11.7.jar:na]
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [org.scala-lang.scala-library-2.11.7.jar:na]
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [org.scala-lang.scala-library-2.11.7.jar:na]
Caused by: com.aerospike.client.AerospikeException: Error Code 9: Timeout
at com.aerospike.client.command.MultiCommand.parseGroup(MultiCommand.java:96) ~[com.aerospike.aerospike-client-3.2.2.jar:na]
at com.aerospike.client.command.MultiCommand.parseResult(MultiCommand.java:71) ~[com.aerospike.aerospike-client-3.2.2.jar:na]
at com.aerospike.client.command.SyncCommand.execute(SyncCommand.java:57) ~[com.aerospike.aerospike-client-3.2.2.jar:na]
at com.aerospike.client.command.Executor$ExecutorThread.run(Executor.java:127) ~[com.aerospike.aerospike-client-3.2.2.jar:na]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[na:1.8.0_66-internal]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[na:1.8.0_66-internal]
at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_66-internal]
I think it is something related to batch reads, because the batch_index_timeout gets incremented on every timeout exception. My database has a lot of large records (512K-1M) that are retrieved in these batch get operations.
I have a 5 nodes cluster, all of them are 2 CPUs, 3.5GB RAM each. Bellow there is a graph of the moment when the reads per second drops, which is exactly when the error starts happening. Is this a problem in the client or in the server? Both seems to be running ok with low cpu, network and memory consumption.
Edit 1: After some tests, I noticed that when the servers have more memory available operates at a lower level of I/O. I think this is because operational system disk cache is working and avoiding disk operations at common accessed areas. This can clearly be seen in the graph bellow:
- I have 1 node with SSD, 3.5G RAM and storage configured to device /dev/sdb
- I have 5 nodes with HDD, 7.5G RAM and storage configured to file /data/aerospike.dat
In the beginning, the HDD nodes increase I/O, but after some time it goes down and operated normally. The SDD server, with few memory, is always reading from disk. When I increase load in the cluster, this has a lot of impact on the SDD, but low impact on the HDD servers. With this configuration, there is no more timeout events.
Edit 2:
I started I completely new test environment, with 3 aerospike servers: 6CPU, 21GB RAM, 200GB SSD each. Replication factor of 2. This configuration can handle a lot more requests than the configuration before (40000 TPS at the moment). But even this config gets to a point where I get a lot of timeouts. The odd thing is that only one of the two servers present timeout errors in the monitoring (err_tsvc_requests_timeout variable). The other difference is that batch_index_timeout is not happening anymore, only err_tsvc_requests_timeout, and only in one of the three servers. Bellow the graphs of the 3 servers. The one with timeout problem is aerospike-new2. The aerospike-new3 was added after some time the other two were already running.