I was going through the DataStax documentation and found an interesting statement.
It claimed "Insert-heavy workloads are CPU-bound in Cassandra before becoming memory-bound".
Can someone explain about how this claim is made? and what might be causing this behavior of Cassandra??
Thanks.
For different workloads, Cassandra clusters can be CPU, memory, I/O or (occasionally) network bound. The claim in the documentation is, if you start a new cluster and make lots of inserts, the cluster will initially be CPU bound but after a while it becomes bottlenecked on memory.
To process an insert, Cassandra needs to deserialize the messages from the clients, find which nodes should store the data and send messages to those nodes. Those nodes then store the data in an in memory data structure called a Memtable.
This is almost always CPU bound initially. However, as more data is inserted, the memtables grow large and are flushed to disk and new (empty) memtables are created. The flushed memtables are stored in files known as SSTables. There is an ongoing background process called compaction that merges SSTables together into progressively larger and larger files.
There are a few reasons why more memory will help at this stage:
- If Cassandra is low on heap space, it will flush memtables when they are smaller. This creates smaller SSTables so more work to compact them.
- If the workload involves overwrites or inserts to the same row at different times, it is much cheaper to do this if the row is still in a current memtable. If not, the overwrite and new column is stored in a new memtable, then flushed and merged during compaction. So again, less memory means more compaction work.
- Your OS uses memory to buffer reads and writes during compaction. If the OS can't then there will be extra I/O, slowing down memtable flushing and compaction.
- Inserts into Cassandra consume lots of Java objects so create work for the garbage collector. If the heap is too small inserts may be paused while GC runs to make some free heap. (On the other hand, if the heap is too large, inserts may be paused for a few seconds during stop-the-world GC.)
So inserts may become memory bound, but they could also become I/O bound. If there isn't enough I/O to flush memtables then inserts will become blocked once the memtable flush queue is full. So I think the claim could be a bit more accurate:
Insert-heavy workloads are CPU-bound in Cassandra before becoming memory or I/O bound.