I'm considering how to design a mechanism for replicating a (potentially large) MongoDB or other NoSQL (CouchDB, etc) database to dozens of clients at once. The clients would function like a replica set, but the replication would be one-way and the remote clients would belong to other parties. Specifically, I am looking for the following features:
- real-time: changes to the master database should be pushed out to the clients as quickly as possible
- replication to new clients: a new client must be able to connect, automatically sync the majority of existing data, then receive real-time updates.
- efficient: both the initial synchronization/transfer of data and tracking of real-time updates ("diffs", if you will) are computationally efficient, with multiple clients connected.
- secure: the master database presents an interface to which remote clients (who do not belong to the same owner or system) can connect: i.e., we cannot just add all the clients to the master's replica set.
- robust: a temporarily connection failure between a client and the master database should be easily and efficiently recoverable.
In some sense, the server is publishing a collection of data and the clients are subscribing to it. I realize that this is a hard software engineering problem, and to my knowledge no piece of software has implemented this exactly yet. However, some approaches have come to mind as close, which I'll list below.
Meteor's DDP protocol: It's designed to do this with Mongo-like collections and exactly implements the model of publishing and subscribing to a set of data (rather than a stream of messages). It manages the initial sync and sends along live changes. However, it's still in development, and far from being an industrial-strength solutions - current drawbacks are that the server keeps a copy of every client's state in a possibly inefficient way and is only tested on collections that can fit in the memory of a web app. Also, it appears that DDP cannot efficiently sync an out-of-date database without fetching everything from scratch. If anyone can point to some examples of how large of a collection can be synced over DDP, that would be great. (See also: https://stackoverflow.com/q/10128430/586086)
Broadcasting the Mongo oplog: Using a high-throughput message bus like Apache Kafka, one may be able to efficiently send the oplog to many clients at once. This tackles some of the system implementation challenges. However, this requires that the clients start with an initial sync that gets them close enough to the current master state somehow and then start replaying the oplog from the appropriate point.
Continuous replication a la CouchDB: I'm not sure how this is implemented and how robust it is, given the sparsity of the documentation. However, it does seem to work over remote database connections. How efficient is this, though, when multiple clients are trying to replicate at the same time? (A similar hack to this would be to make the clients MongoDB Priority 0 replica set members; however, that seems to be far from its intended use. See also: http://guide.couchdb.org/draft/replication.html)
Please give pointers to software or pieces of software that already implement parts of this, or suggestions on the algorithms/data structures needed to do this efficiently.
If you are looking specifically for real-time replication, I'd recommend you look into SaaS offerings specifically for this purpose, such as https://www.firebase.com/