How does Cassandra scale horizontally ?

2020-08-21 02:29发布

问题:

I've watched a video on Cassandra database, which turns to be very effective and really explains a lot about Cassandra. I've also ready some article and books about Cassandra but the thing I could not understand is how does Cassandra scale horizontally. By horizontally scale I mean add more nodes to gain more space. As I understand each node has the identical data i.e if one node has 1TB of data and is replicated to other nodes this means all n nodes will each contain 1TB of data. Am I missing something here ?

回答1:

Yes, you are missing something. Data may not need to be duplicated n times, where n is the number of nodes. You would typically configure your replication factor (RF) to be lower than the number of nodes (N).

For example, RF = 3, N = 5. Meaning each row will be duplicated 3 times across randomly chosen 3 nodes out of 5 nodes (plus the pristine copy). If one node goes down, you will have 3 copies elsewhere on the other nodes.

This works better in larger clusters, e.g. RF = 5, N = 100.

Higher RF improves data redundancy and read speed, but decreases your write speed. So there is a balance, if your RF is very high, like RF = N, you'd have very high data redundancy, high resilience to node failures, and high read throughput. On the other side your write throughput will be very limited, as data needs to be replicated to all the nodes. If one node goes down in this scenario the write may fail (depending on client config) as desired replication factor cannot be achieved.



回答2:

The number of replicas (i.e. the identical data) you want to store for each partition (row/piece of data) is configurable. So, if you have n nodes, you could in theory set the database to replicate each partition n times. Then, horizontal scaling would not occur if you add more nodes. However, if you set the number of replicas to 1 or 2, you have more space per node to store data horizontally. New data can then go into new nodes. Keep in mind though, that with less replicas you have a greater chance of losing data if any set of nodes go down at a particular time.



回答3:

Yes, a lot.

Replication happens depending on the replication factor for the keyspace. So if replication factor is 2, two replicas will be created. In a 20-node cluster, this would mean only 3 nodes will have one set of data, the other 17 nodes will have the rest of the data.

Data in nodes is divided up based on the data in the columns set as clustering key. So a set of rows having the same data in a cluster key column will be placed in a single node. This is to ensure that one query need only hit one node to fulfill the query.



回答4:

As I understand each node has the identical data i.e if one node has 1TB of data and is replicated to other nodes this means all n nodes will each contain 1TB of data. Am I missing something here ?

Yes, not all nodes are necessarily copies of each other. Depending on the level of availability I want to support, I can set my replication factor lower than the total number of nodes.

Let's say that I have a 2 node cluster with a replication factor of 2. So in this case, each node does have a complete copy of the data. If I am running out of disk, I can alleviate some of that by adding a new node while keeping my replication factor set at 2 (3 nodes, RF of 2).

In this way if each disk has 1TB of storage, and I'm at 900GB on each, adding a new node (while keeping my RF the same) makes each node responsible for only 2/3 of the data. So in this case, each node would hold 600GB of data (freeing up 300GB on my 2 existing nodes). And thus, I have increased my disk capacity by scaling horizontally.

The catch is that even though I have 3 nodes, I can really only afford to lose one of them. If I lose two nodes, then I can't serve my queries.



回答5:

A replication factor of 3 means that there will be three copies of the data within a cluster. The replication factor also determines the number of nodes that return when using quorum reads/writes. A quorum read/write means that the query will be sent to (RF/2 + 1). Given an RF of 3, the query will be sent to two nodes (decimals are always rounded down). If you always do quorum reads and writes, you will always have consistent responses as at least one node in the replica set has the data that is being queried.

From the book Practical Cassandra, which means what the formula is RF/2 +1 - the number of copies of a keyspace



回答6:

I think the piece missing is understanding tokenization. The cluster has a range of tokens, that range is divided up across the nodes for ownership. When data is inserted, it is assigned a token which determines its placement in the cluster. (Note: that is the primary token placement, and with an RF=3, there would be two other places in the cluster where that data would exist.)

Therefore, if you have 9 nodes, your token range is divided into 9 sections, and data is placed across the 9 nodes as assigned a token. If however, your cluster has 90 nodes, the token range is divided into 90 sections, and the data gets assigned and placed across 90 nodes.

Understanding tokens and placement is critical, and should not be confused with topology.