How much data per node in Cassandra cluster?

2020-02-14 02:24发布

Where are the boundaries of SSTables compaction (major and minor) and when it becomes ineffective?

If I have major compaction couple of 500G SSTables and my final SSTable will be over 1TB - will this be effective for one node to "rewrite" this big dataset?

This can take about day for HDD and need double size space, so are there best practices for this?

1条回答
Emotional °昔
2楼-- · 2020-02-14 03:10

1 TB is a reasonable limit on how much data a single node can handle, but in reality, a node is not at all limited by the size of the data, only the rate of operations.

A node might have only 80 GB of data on it, but if you absolutely pound it with random reads and it doesn't have a lot of RAM, it might not even be able to handle that number of requests at a reasonable rate. Similarly, a node might have 10 TB of data, but if you rarely read from it, or you have a small portion of your data that is hot (so that it can be effectively cached), it will do just fine.

Compaction certainly is an issue to be aware of when you have a large amount of data on one node, but there are a few things to keep in mind:

First, the "biggest" compactions, ones where the result is a single huge SSTable, happen rarely, even more so as the amount of data on your node increases. (The number of minor compactions that must occur before a top-level compaction occurs grows exponentially by the number of top-level compactions you've already performed.)

Second, your node will still be able to handle requests, reads will just be slower.

Third, if your replication factor is above 1 and you aren't reading at consistency level ALL, other replicas will be able to respond quickly to read requests, so you shouldn't see a large difference in latency from a client perspective.

Last, there are plans to improve the compaction strategy that may help with some larger data sets.

查看更多
登录 后发表回答