I am running a write-heavy program (10 threads peaks at 25K/sec writes) on a 24 node Cassandra 3.5 cluster on AWS EC2 (each host is of c4.2xlarge type: 8 vcore and 15G ram)
Every once in a while my Java client, using DataStax driver 3.0.2, would get write timeout issue:
com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during write query at consistency TWO (2 replica were required but only 1 acknowledged the write)
at com.datastax.driver.core.exceptions.WriteTimeoutException.copy(WriteTimeoutException.java:73)
at com.datastax.driver.core.exceptions.WriteTimeoutException.copy(WriteTimeoutException.java:26)
at com.datastax.driver.core.DriverThrowables.propagateCause(DriverThrowables.java:37)
at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:245)
at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:64)
The error happens infrequently and in a very unpredictable way. So far, I am not able to link the failures to anything specific (e.g. program running time, data size on disk, time of the day, indicators of system load such as CPU, memory, network metrics) Nonetheless, it is really disrupting our operations.
I am trying to find the root cause of the issue. Looking online for options, I am a bit overwhelmed by all the leads out there, such as
- Changing "write_request_timeout_in_ms" in "cassandra.yaml" (already changed to 5 seconds)
- Using proper "RetryPolicy" to keep the session going (already using DowngradingConsistencyRetryPolicy on a ONE session level consistency level)
- Changing cache size, heap size, etc. - never tried those b/c there are good reasons to discount them as the root cause.
One thing is really confusing during my research is that I am getting this error from a fully replicated cluster with very few ClientRequest.timeout.write events:
- I have a fully-replicated 24 node cluster spans 5 aws regions. Each region has at least 2 copies of the data
- My program runs consistency level ONE at Session level (Cluster builder with QueryOption)
- When the error happened, our Graphite chart registered no more than three (3) host hiccups, i.e. having the Cassandra.ClientRequest.Write.Timeouts.Count values
- I already set write_timeout to 5 seconds. The network is pretty fast (using iperf3 to verify) and stable
On paper, the situation should be well within Cassandra's failsafe range. But why my program still failed? Are the numbers not what they appear to be?