Is it possible to recover from a network partition in an mnesia cluster without restarting any of the nodes involved? If so, how does one go about it?
I'm interested specifically in knowing:
- How this can be done with the standard OTP mnesia (v4.4.7)
- What custom code if any one needs to write to make this happen (e.g. subscribe to mnesia running_paritioned_network events, determine a new master, merge records from non-master to master, force load table from the new master, clear running parititioned network event -- example code would be greatly appreciated).
- Or, that mnesia categorically does not support online recovery and requires that the node(s) that are part of the non-master partition be restarted.
While I appreciate the pointers to general distributed systems theory, in this question I am interested in erlang/OTP mnesia only.
Sara's answer is great, even look at article about CAP. Mnesia developers sacrifice P for CA. If you need P, then you should choice what of CAP you want sacrifice and than choice another storage. For example CouchDB (sacrifice C) or Scalaris (sacrifice A).
It works like this. Imagine the sky full of birds. Take pictures until you got all the birds. Place the pictures on the table. Map pictures over each other. So you see every bird one time. Do you se every bird? Ok. Then you know, at that time. The system was stable. Record what all the birds sounds like(messages) and take some more pictures. Then repeat.
If you have a node split. Go back to the latest common stable snapshot. And try** to replay what append after that. :)
It's better described in "Distributed Snapshots: Determining Global States of Distributed Systems" K. MANI CHANDY and LESLIE LAMPORT
** I think there are a problem deciding who's clock to go after when trying to replay what happend
After some experimentation I've discovered the following:
force_load_table
after the network is partitioned.So to answer my question, one can perform semi online recovery by executing
mnesia:stop(), mnesia:start()
on the nodes in the partition whose data you decide to discard (which I'll call the losing partition). Executing themnesia:start()
call will cause the node to contact the nodes on the other side of the partition. If you have more than one node in the losing partition, you may want to set the master nodes for table loading to nodes in the winning partition - otherwise I think there is a chance it will load tables from another node in the losing partition and thus return to the partitioned network state.Unfortunately mnesia provides no support for merging/reconciling table contents during the startup table load phase, nor does it provide for going back into the table load phase once started.
A merge phase would be suitable for ejabberd in particular as the node would still have user connections and thus know which user records it owns/should be the most up-to-date for (assuming one user conneciton per cluster). If a merge phase existed, the node could filter userdata tables, save all records for connected users, load tables as per usual and then write the saved records back to the mnesia cluster.