When do I use a consensus algorithm like Paxos vs

2020-07-01 02:47发布

问题:

I've been reading a lot about different strategies to guarantee consistency between nodes in distributed systems, but I'm having a bit of trouble figuring out when to use which algorithm.

With what kind of system would I use something like a vector clock? Which system is ideal for using something like Paxos? Are the two mutually exclusive?

回答1:

There's a distributed system of 2 nodes that store data. The data is replicated to both nodes so that if one node dies, the data is not lost (durability) and continues to be served (availability). And also you hope your 2-node system will handle twice as many requests per second (scalability).

Suppose the writes to a single key can come to any node. Your client writes "1" as the value for some key, then it decides to write "2". The first write goes to node#1. It issues a replication request to node#2. However, your request to store "2" comes to node#2 (we can store on any node, remember) earlier than the replication request. It stores "2", issues a replication request with "2" to node#1, receives a replication request with "1" from it, changes its "2" to "1", while node#1 changes its "1" to "2". Now you have inconsistency in your data between the storage nodes. Also, if node#1 dies, all you have is node#2 that has value "1", while you remember it very well that you sent "2" after "1", and the storage system has confirmed that it saved it. Actually, many things might go "wrong", depending on what you expect from your storage system (read your writes? monotonic reads? etc), so you need a way to actually find out what the true, good, actual value for the key is, or even to prevent the system from "corrupting" data in this way. For that, the storage system needs to know what happened before what, either between its nodes, or it might even include your clients vision of the order of events into consideration. Vector clocks and version vectors are some of the techniques used in practice to achieve that or claim that 2 events have happened concurrently and you need some other way to decide between the results of them.

You decide to tackle the problem in a different way in order to avoid all these complexities: all writes for a certain key will go to one node (called "leader"), and it will replicate these writes onto the other node. Indeed, that looks like a simpler scheme: within one node (and likely one process) you have fast and proven concurrency control techniques, can order events easily, can apply replication in the same order; also, there's always an authoritative source of the right data. The only problem is that your 2 storage nodes need to agree which node is the leader for a particular key. And if you had 3 nodes and one of them died, the other 2 would need to decide 1) that they both think the old leader died, 2) which one of them is the new leader. For that, consensus protocols exist (Paxos, 2-phase commit, Raft, Zab, 3-phase commit etc).

Why not always choose single leader (and hence a consensus protocol) over leader-less scheme (and hence an ordering mechanism like version vectors)? Negotiating leadership takes time (think up to seconds or tens of seconds) during which your system is unavailable or partially available in some special mode. Leaderless can perform better under some other conditions as well (e.g. the leader becomes slow due to software problems or network problems: with leaderless approach other nodes might take over its duties). Consensus becomes harder as the number of participants increases, so leaderless can potentially scale better.

Finally, let's gallop through your questions literally:

With what kind of system would I use something like a vector clock?

You might want to use a version vector for a leaderless distributed storage. You might use vector clocks for the same (although it's a worse fit; the article also suggests you use it for consistent snapshots, for implementing causal ordering in general distributed systems etc).

Which system is ideal for using something like Paxos?

A single-leader or multi-leader distributed storage. A database of rarely updated data (think configs), cluster participation info -- if this information is critical, otherwise gossip scales better. Distributed locks.

Are the two mutually exclusive?

No. Both can be used for solving the same tasks (e.g. distributed storage). They can be combined (paxos for cluster participation and then use that knowledge to determine which nodes form a quorum in an eventually consistent (through version vectors) system).