I'm paying with distributed erlang applications.
Configuration and ideas are taken from:
http:/www.erlang.org/doc/pdf/otp-system-documentation.pdf 9.9. Distributed Applications
- We have 3 nodes: n1@a2-X201, n2@a2-X201, n3@a2-X201
- We have application wd that do some useful job :)
Configuration files:
- wd1.config - for the first node:
[{kernel, [{distributed,[{wd,5000,['n1@a2-X201',{'n2@a2-X201','n3@a2-X201'}]}]}, {sync_nodes_mandatory,['n2@a2-X201','n3@a2-X201']}, {sync_nodes_timeout,5000} ]} ,{sasl, [ %% All reports go to this file {sasl_error_logger,{file,"/tmp/wd_n1.log"}} ] }].
- wd2.config for the second:
[{kernel, [{distributed,[{wd,5000,['n1@a2-X201',{'n2@a2-X201','n3@a2-X201'}]}]}, {sync_nodes_mandatory,['n1@a2-X201','n3@a2-X201']}, {sync_nodes_timeout,5000} ] } ,{sasl, [ %% All reports go to this file {sasl_error_logger,{file,"/tmp/wd_n2.log"}} ] }].
- For the node n3 looks similar.
Now start erlang in 3 separate terminals:
- erl -sname n1@a2-X201 -config wd1 -pa $WD_EBIN_PATH -boot start_sasl
- erl -sname n2@a2-X201 -config wd2 -pa $WD_EBIN_PATH -boot start_sasl
- erl -sname n3@a2-X201 -config wd3 -pa $WD_EBIN_PATH -boot start_sasl
Start application on each of erlang nodes: * application:start(wd).
(n1@a2-X201)1> application:start(wd). =INFO REPORT==== 19-Jun-2011::15:42:51 === wd_plug_server starting... PluginId: 4 Path: "/home/a2/src/erl/data/SIG" FileMask: "(?i)(.*)\\.SIG$" ok
(n2@a2-X201)1> application:start(wd). ok (n2@a2-X201)2>
(n3@a2-X201)1> application:start(wd). ok (n3@a2-X201)2>
At the moment everything is Ok. As written in Erlang documentation: Application is running at node n1@a2-X201
Now kill node n1: Application was migrated to n2
(n2@a2-X201)2> =INFO REPORT==== 19-Jun-2011::15:46:28 === wd_plug_server starting... PluginId: 4 Path: "/home/a2/src/erl/data/SIG" FileMask: "(?i)(.*)\\.SIG$"
Continue our game: kill node n2 One more time system works fine. We have our application at node n3
(n3@a2-X201)2> =INFO REPORT==== 19-Jun-2011::15:48:18 === wd_plug_server starting... PluginId: 4 Path: "/home/a2/src/erl/data/SIG" FileMask: "(?i)(.*)\\.SIG$"
Now restore nodes n1 and n2. So:
Erlang R14B (erts-5.8.1) [source] [smp:4:4] [rq:4] [async-threads:0] [hipe] [kernel-poll:false] Eshell V5.8.1 (abort with ^G) (n1@a2-X201)1> Eshell V5.8.1 (abort with ^G) (n2@a2-X201)1>
Nodes n1 and n2 are back.
Looks like now I have to restart application manually:
* Let's do it at node n2 first:
(n2@a2-X201)1> application:start(wd).
- Looks like it hanged ...
- Now restart it at n1
(n1@a2-X201)1> application:start(wd). =INFO REPORT==== 19-Jun-2011::15:55:43 === wd_plug_server starting... PluginId: 4 Path: "/home/a2/src/erl/data/SIG" FileMask: "(?i)(.*)\\.SIG$" ok (n1@a2-X201)2>
It works. And node n2 also has returned OK:
Eshell V5.8.1 (abort with ^G) (n2@a2-X201)1> application:start(wd). ok (n2@a2-X201)2>
At node n3 we see:
=INFO REPORT==== 19-Jun-2011::15:55:43 === application: wd exited: stopped type: temporary
In general, everything looks ok, as written in documentation, except for delay with starting application at node n2.
Now kill node n1 once more:
(n1@a2-X201)2> User switch command --> q [a2@a2-X201 releases]$
Ops ... everything hangs. Application was not restarted at another node.
Actually, while I was writing this post I've realized that sometime everything id Ok, sometime I have a problem.
Any ideas, While there could be problems when restoring "primary" node nd killing it one more time?