-->

High Availability in Cassandra

2020-04-28 16:28发布

问题:

1) I have 5 node cluster (172.30.56.60, 172.30.56.61, 172.30.56.62, 172.30.56.63, 172.30.56.129)

2) I created a keyspace with Replication Factor as 3
write consistency as 3, I have inserted a row in a table with the partition as '1' like below,

INSERT INTO user (user_id, user_name, user_phone) VALUES(1,'ram', 9003934069);

3) I verified the location of the data using the nodetool getendpoints utility and observed that the data is copied in three nodes 60, 129 and 62.

./nodetool getendpoints keyspacetest user 1
172.30.56.60
172.30.36.129
172.30.56.62

4) Now If I bring down the node 60, Cassandra needs to transfer the existing data to '1,'ram', 9003934069' to the remaining node (to either 61 or 63) to maintain the RF as '3'?

But Cassandra is not doing that, so does it mean that If the nodes 60, 129 and 62 are down I will not be able to read / write any data under the partition '1' in the table 'user' ?

Ques 1 : So even If I have 5 node cluster, If the data / partiton where it resides goes down, the cluster is useless?

Ques 2 : If two nodes are down (Example : 60 and 129 is down) still 61,62 and 63 are up and running, but I am not able to write any data in the partition '1' with the write consistency = 3, Why it is so? Where as I am able to write the data with the write consistency = 1 so this again says the data for the partition will be available only in the predefined nodes in cluster, No possibility for repartitioning?

If any part of my question is not clear, Please let me know, I would like to clarify it.

回答1:

4) Now If I bring down the node 60, Cassandra needs to transfer the existing data to '1,'ram', 9003934069' to the remaining node (to either 61 or 63) to maintain the RF as '3'?

That is not the way Cassandra works - replication factor 'only' declares how many copies of your data is to be stored Cassandra on disk on different nodes. Cassandra mathematically forms a ring out of your nodes. Each node is responsible for a range of so called tokens (which are basically a hash of your partition key components). Your replication factor of three means that data will be stored on the node taking care of your datas token and the next two nodes in the ring.

(quick googled image https://image.slidesharecdn.com/cassandratraining-161104131405/95/cassandra-training-19-638.jpg?cb=1478265472)

Changing the ring topology is quite complex and not done automatically at all.

1) I have 5 node cluster (172.30.56.60, 172.30.56.61, 172.30.56.62, 172.30.56.63, 172.30.56.129)

2) I created a keyspace with Replication Factor as 3 write consistency as 3, I have inserted a row in a table with the partition as '1' like below,

INSERT INTO user (user_id, user_name, user_phone) VALUES(1,'ram', 9003934069);

3) I verified the location of the data using the nodetool getendpoints utility and observed that the data is copied in three nodes 60, 129 and 62.

./nodetool getendpoints keyspacetest user 1 172.30.56.60 172.30.36.129 172.30.56.62 4) Now If I bring down the node 60, Cassandra needs to transfer the existing data to '1,'ram', 9003934069' to the remaining node (to either 61 or 63) to maintain the RF as '3'?

But Cassandra is not doing that, so does it mean that If the nodes 60, 129 and 62 are down I will not be able to read / write any data under the partition '1' in the table 'user' ?

Ques 1 : So even If I have 5 node cluster, If the data / partiton where it resides goes down, the cluster is useless?

No. On the other hand there is the consistency level - where you define how many nodes must acknowledge your write and read request before it is considered successful. Above you also took CL=3 and RF=3 - that means all nodes holding replicas have to respond and need to be online. If a single one is down your requests will fail all the time (if your cluster was bigger, say 6 nodes, chances are that the three online may be the 'right' ones for some writes).

But Cassandra has tuneable consistency (see the docs at http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/dml/dml_config_consistency_c.html).

You could pick QUORUM for example. Then (replication factor/2)+1 nodes are needed for queries. In your case (3/2)+1=1+1=2 nodes. QUORUM is perfect if you really need consistent data as in any case at least one node participating in your request will overlap between write and read and have the latest data. Now one node can be down and every thing will still work.

BUT:

Ques 2 : If two nodes are down (Example : 60 and 129 is down) still 61,62 and 63 are up and running, but I am not able to write any data in the partition '1' with the write consistency = 3, Why it is so? Where as I am able to write the data with the write consistency = 1 so this again says the data for the partition will be available only in the predefined nodes in cluster, No possibility for repartitioning?

Look above - that's the explanation. CL=1 for write consistency will succeed because one node is still online and you request only one to acknowledge your write.

Of course replication factor is not useless at all. Writes will replicate to all nodes available even if a lower consistency level is choosen, but you will not have to wait for it on client side. If a node is down for some short period (default 3 hours) of time the coordinator will store the missed writes and replay them if the node comes up again and your data is fully replicated again.

If a node is down for a longer period of time it is necessary to run nodetool repair and let the cluster rebuild a consistent state. That should be done on a regular schedule anyway as maintenance task to keep your cluster healty - there could be missed writes because of network/load issues and there are tombstones from deletes with could be a pain.

And you can remove or add nodes to your cluster (if doing so, just add one at a time) and Cassandra will repartition your ring for you.

In case of removing an online node can stream the data on it to the others, an offline node can be removed but the data on it will not have sufficient replicas so a nodetool repair must be run.

Adding nodes will assign new token ranges to the new node and automatically stream data to your new node. But existing data is not deleted for the source nodes for you (keeps you safe), so after adding nodes nodetool cleanup is your friend.

Cassandra chooses to be A(vailable) and P(artition tolerant) from CAP theorem. (see https://en.wikipedia.org/wiki/CAP_theorem). So you can't have consistency at any time - but QUORUM will often be more than enough.

Keep your nodes monitored and don't be too afraid of node failure - it simply happens all the time disks die or network is lost but design your applications for it.

Update: It's up to the user to choose what can happen to your cluster before you are loosing data or queries. If needed you can go with higher replication factors (RF=7 and CL.QUROUM tolerates loss of 3) and/or even with multiple datacenters on different locations in case one loses an entire datacenter (which happens in real life, think of network loss).


For the comment below regarding https://www.ecyrd.com/cassandracalculator/:

Cluster size 3 Replication Factor 2 Write Level 2
Read Level 1

Your reads are consistent: Sure, you request writes need to be ack'd by all replicas.

You can survive the loss of no nodes without impacting the application: See above, RF=2 und WC=2 request that at any time all nodes need to respond to writes. So for writes your application WILL be impacted, for reads one node can be down.

You can survive the loss of 1 node without data loss: as data is written to 2 replicas and you only read from one if one node is down you can still read from the other one.

You are really reading from 1 node every time: RC=1 requests your read to be served by one replica - so the frist one that ack's the read will do, if one node is down that won't matter as the other one can ack your read.

You are really writing to 2 nodes every time: WC=2 requests that every write will be ack'd by two replicas - which is also the number of replicas in your example. So all nodes need to be online when writing data.

Each node holds 67% of your data: Just some math ;)

With those settings you can't survive a node loss without impact while writing to your cluster. But your data is written to disk on two replicas - so if you lose one you still have your data on the other one and recover from a dead node.