I am implementing Paxos in a cluster simulator application, using the documentation available in Wikipedia. Unfortunately, it leaves several doors open to interpretation and does not provide much information on key implementation issues. It is unclear and incomplete.
- Assuming a cluster divided in 3 regions, each containing 3 nodes (total = 9 nodes). What happens if communication is broken between regions? There is no way any leader can reach quorum (which is 5).
Isn't Paxos going to enter an infinite loop? I guess one should not initiate Paxos if one cannot communicate with at least a quorum of nodes.
- In Phase 1b: 'If the proposal number N is larger than any previous proposal, then each Acceptor promises not to accept proposals less than N, and sends the value it last accepted for this instance to the Proposer'.
What is 'the last value it accepted'? Is it any previous proposal number from the proposer? What does 'instance' refer to exactly in this case?
In Phase 1a: Does one include the value to agree on with the Prepare message or is this deferred to the Accept! message? Or it does matter?
In Phase 2a: 'If any of the Acceptors have already accepted a value, the leader must Choose a value with the maximum proposal number N'.
What is value here? Is it the proposal number? I believe not, but this phrase is unclear.
In Phase 2a: 'Otherwise, the Proposer is free to choose any value'. What does this mean? A value for what? For the proposal number?
Paxos seems to rely on an increasing value of N (proposal number) to work? Is this correct?
The wikipedia entry does not discuss the initial values a node should set before starting to participate in Paxos. What are these?
P.S.: I don't have enough reputation to create a 'Paxos' tag (any volunteer?)
I have found the following document explaining Paxos in more details. I have updated the wikipedia entry accordingly.
The answers to my question I could find are:
Paxos only works if at least a quorum of nodes can communicate with each other (in our case 5). Hence, if a node cannot communicate with at least a quorum of nodes, it should not try Paxos.
It is the last accepted proposition number and corresponding value.
It refers to the acceptor.
The value is not included in the Prepare message, it is left to the Accept Request message.
If acceptors have already accepted a proposal from the proposer, they can return the corresponding proposal number and value, else nothing.
The second question falls, since the Wikipedia entry was misleading. One can choose an arbitrary value for a given proposal or derive it from values corresponding to proposals accepted earlier.
Yes. A proposer p needs to number its proposals increasingly.
Nodes should keep their last accepted proposal number and eventually, the corresponding value too. They should persist it. When connecting for the first time, the initial proposal number for a given proposer should be null (or any equivalent).
Each proposer has stable storage. Each proposer remembers (in stable storage) the highest-numbered proposal it has tried to issue, and begins phase 1 with a higher proposal number than any it has already used.
The nomenclature in Paxos is a little unintuitive.
Paxos requires you can get at least a quorum (5 nodes in your case). Go with your solution of three regions; having two network partitions between the three regions is very bad news. I also use a version of Paxos which can change node membership from one instance to the next. This is useful for partitions and node failure.
A naive implementation of Paxos is not guaranteed to terminate because multiple nodes can leap-frog Prepare phases. There are two ways of getting around this. One is to have a random backoff before starting new Prepare phases. The second is to route all requests to a designated leader, that acts as proposer (The leader is chosen by a Paxos instance. See also Multi-paxos)
When a node receives an Accept!(roundId, value) message from a Proposer and it hasn't promised to not accept the value (due to a Prepare!(higherRoundId) message), it stores the value and the roundId (I'll call them
acceptedValue
andacceptedRoundId
). It may write over these due to subsequent Accept!(...) messages.When a node receives a Prepare!(roundId) message from a Proposer, it stores roundId as
promiseRoundId = max(roundId, promiseRoundId)
. It then sends aPromise!(acceptedRoundId, acceptedValue)
back to the Proposer. NB: if a node hasn't received an Accept!(...) message, it replies withPromise!(null, null)
.There is no need to send it. I don't.
The value is the actual data the algorithm is reaching consensus on. I'll rephrase this to
To start the Accept Phase, The Proposer must choose a value to be accepted depending on the results of the Prepare phase. If any Acceptor replied with Promise(roundId, value), the Proposer must use the value associated with the highest roundId. Otherwise, the Proposer received only Promise(null, null), and may choose any value to send to the acceptors.
NB: Proposal number here is the same thing as roundId.
This is the value you want to have consensus on. This is typically a state change across the distributed system, perhaps triggered by a client request.
Round ids (aka proposal numbers) should be increasing and must be unique per instance across all nodes. The Paxos paper assumes you can do this because it is trivial to achieve. Here's one scheme that produces the same results on all nodes:
roundId = i*M + index[node]
where i is the ith round this node is starting (that is i is unique per node per paxos instance, and is monotonically increasing).Or in pseudo-code (which is clearly lacking a few major optimizations):