-->

Using Paxos to synchronize a large file across nod

2019-07-13 16:09发布

问题:

I'm trying to use Paxos to maintain consensus between nodes on a file that is around 50MB in size, and constantly being modified at individual nodes. I'm running into issues of practicality. Requirements:

  1. Sync a 50MB+ file across hundreds of nodes
  2. Have changes to this file, which can be made from any node, and aren't likely to directly compete with each other, propagated across the network in a few seconds at most
  3. New nodes that join the network can within a few minutes (<1 hour) build up the entire file by following along with the Paxos messages

The problem I'm facing is that there doesn't seem to be a way to accomplish both goals 2 and 3.

Here are the options I've considered so far:

  • Sync the entire file each round — Completely impractical, Paxos rounds would take minutes
  • Sync only changes to the file — Reasonable for goals 1 and 2, but breaks goal 3, as new nodes would only be able to sync the entire file once every unit of state has been changed
  • Sync changes & a random piece of the file each round — I'm not sure if Paxos allows for this. Nodes would be able to verify the changes against their own (allowing for new changes), and would be able to verify the random piece of the file against said piece of their version, but is this practical?

I'm thinking the third option is best, but I'm not sure if Paxos allows this. The idea would be to limit the data exchanged each round to maybe 1500 bytes, and fill that 1500 bytes with changes to the file primarily. Most rounds, the file would be unchanged, and the rounds in which something changed would most likely be less than 100 bytes of altered state, so the other 1400 bytes could be filled with some piece of the file, which would allow new nodes to build up the entire file over time. Is this practical? Has this problem already been solved?

回答1:

As peter mentioned in the comments, an eventually-consistent is probably a better fit. Here's one such algorithm, based on the gossip protocol.

Every node has some version of the file. Every N seconds each node connects to another node and they swap version numbers. If one node is behind the other it downloads the file from the peer.

This converges remarkably quickly, I think within 10-20 gossip rounds for 1000 nodes.


Update

(Introducing raft or a messaging queue into the mix.)

Raft

From your comments it appears you've got a key-value store on your hands. You can think of it as a distributed state machine, in which you treat an update to each key as its own command. This is great for a consensus protocol like paxos or raft (I'm favoring raft now-a-days for the number of open source implementations). What's more, these are often implemented to also act like an atomic broadcast system. In short, a few nodes act as the core decision makers and the rest of the nodes listen to the results.

(Of course, I don't know how your file is being update; i.e. if it is only updated on a master node and the remainders are slaves.)

One major concern will be the fan-out to 1000s of nodes. For this, you'll probably want a hierarchal fan-out. There are various schemes to help out with this; here are a few ideas. A) Have each node connect to two random peers; and stream from the peer with the shortest path to the master node (this is called the power of two choices); or B) chose the peer with shortest route with some probability p. C) Have each node connect to one random peer and with some probability p, stream from that node else connect to its upstream node instead. These probabilities are meant to make an n-ary tree, which is a good balance between every node connecting to the master node, and every node in a linked list.

Messaging Queue

Now, paxos and raft provide some pretty strong guarantees. Specifically for this case, every update will be processed in order--across all keys. If you don't need that guarantee then you can architect a much simpler system.

Each key update could be broadcast to a distributed messaging queue (like SQS, RabbitMQ, etc.) Version each key update and only apply an update if its greater than the version you have. This presents you with a beautiful and fast eventually consistent system.

I'd go with this approach above using raft/paxos if the system allows for it.



回答2:

We can use paxos to replicate the transaction log of writes to the file as described here. When a new server with an empty file joins the cluster it can request a snapshot from an up to date node. In the meanwhile the new node can also listen to and buffer current updates. Once it has loaded the snapshot, then applied the later updates, it is fully up to date.

A snapshot can be a full copy of the file. Taking a full snapshot means blocking writes but not reads. That might not be too much of a performance overhead for a 50M file on raided SSD disk. A more efficient approach could be to model the file as an immutable (pure functional) data structure with copy-on-write semantics. Rather than a flat file we can model it as a persistent data structure of file chunks. An example would be a immutable sorted map where the keys are file chunk numbers and the values are the chunks of the file. Writing to the file means then means inserting one or more updated chunks into the map. With such a data structure all write operations return a new immutable map. The new map shares unmodified file chunks with prior versions of the map. A snapshot of the file is then the immutable version of the map at a particular point in time. With an immutable map no locks need to be take to visit all chunks in the map to transmit the full snapshot of the file state to any new server. Log structured storage uses such a technique.