Currently I am looking how the backup/restore be done in Cassandra. We've setup a three node cluster in AWS. I understand that using nodetool snapshot tool we can take a snapshot but it's bit cumbersome process.
My idea is :
Make use of EBS snapshot because they're more durable and easy to setup but one problem which I see with EBS is inconsistency backup. Hence, my plan is run a script prior to taking EBS snapshot which would just run flush command to flush out all the memtable data and copies it on to the disk(SSTable) and then prepares the hard link with flushed sstables.
Once that's done, initiate the EBS snapshot, this was we can address the inconsistency issue which we might face if we only use EBS snapshost.
Please let me know if you see any issue with this approach or share your suggestions.
Being immutable, SSTables do help a lot when it comes to backups, indeed.
Your ideia sounds ok for situations where everything is healthy on your cluster. Actually, Cassandra is consistency-configurable (if I say eventually consistent, some people may be offended here, hehe), and as the system itself may no be fully consistent at a given time, you cannot say your backup will be as well. But, by the other hand, one of the beauties of Cassandra (and NoSQL models) is that it tends to recover pretty well, which is true for Cassandra in most situations (quite opposite to a relational databases, which are very sensitive to data losses). It's very unlikely you end up with a bunch of useless data if you have at least fully preserved SSTables files.
Be aware that EBS Snapshots are block-level. So, when you have a filesystem on top of it, it may be a concern as well. Fortunately, any modern filesystem have journaling nowadays and are pretty reliable, so that shouldn't be a problem, but having your data in a separate partition is a good practice, so the chances of someone else writing in it right after a full flush are smaller.
You may have some lost replicas when you eventually need to restore you cluster, demanding you to run nodetool repair, what, if you have done before, is a bit painful and takes very long for large amounts of data. (But, repair is recommended to be run regularly anyway, specially if you delete a lot.)
Another thing to consider are hinted handoffs (writes whose row owners are missing, but which are kept by other nodes until the owners come back). I don't know what happens with them when you flush, but I guess they're kept in memory and on commit logs only.
And, off course, do a full restore before you assume this will work in the future.
I don't have a large experience with Cassandra, but what I have heard about backup solutions for it are whole cluster replicas in another region, or datacenter, instead of cold backups like snapshots. It's probably more expensive but more reliable too than raw disks snapshots like you trying to do.
I am not sure how backup of a node will help, because in C* data is already backed up in the replica nodes.
If a node is dead and has to be replaced, the new node will learn about the data from other nodes that it needs to own and get it from other nodes, so you might not need to restore from a disk backup.
Would a replication scenario like the following help ?
Use two data centers (DC:A with 3 nodes) (DC:B with one node) with RF of (A:2 & B:1). Allow clients to interact with nodes in DC:A, with a Read/write consistency of Local_QUORUM. Here since quorum in 2 all reads and write will be successful and you will get data replicated on DC:B. Now you could back up DC:B