Erlang: Distributed Application Strange behaviour

2019-02-06 23:36发布

问题:

I'm paying with distributed erlang applications.

Configuration and ideas are taken from:
http:/www.erlang.org/doc/pdf/otp-system-documentation.pdf 9.9. Distributed Applications

  • We have 3 nodes: n1@a2-X201, n2@a2-X201, n3@a2-X201
  • We have application wd that do some useful job :)

Configuration files:

  • wd1.config - for the first node:
      [{kernel,
          [{distributed,[{wd,5000,['n1@a2-X201',{'n2@a2-X201','n3@a2-X201'}]}]},
           {sync_nodes_mandatory,['n2@a2-X201','n3@a2-X201']},
           {sync_nodes_timeout,5000}
        ]}
      ,{sasl, [
      %% All reports go to this file
      {sasl_error_logger,{file,"/tmp/wd_n1.log"}}
      ]
    }].
  • wd2.config for the second:
    [{kernel,
        [{distributed,[{wd,5000,['n1@a2-X201',{'n2@a2-X201','n3@a2-X201'}]}]},
         {sync_nodes_mandatory,['n1@a2-X201','n3@a2-X201']},
         {sync_nodes_timeout,5000}
         ]
     }
    ,{sasl, [
        %% All reports go to this file
        {sasl_error_logger,{file,"/tmp/wd_n2.log"}}
    ]
    }].

  • For the node n3 looks similar.

Now start erlang in 3 separate terminals:

  • erl -sname n1@a2-X201 -config wd1 -pa $WD_EBIN_PATH -boot start_sasl
  • erl -sname n2@a2-X201 -config wd2 -pa $WD_EBIN_PATH -boot start_sasl
  • erl -sname n3@a2-X201 -config wd3 -pa $WD_EBIN_PATH -boot start_sasl

Start application on each of erlang nodes: * application:start(wd).

(n1@a2-X201)1> application:start(wd).

=INFO REPORT==== 19-Jun-2011::15:42:51 ===
wd_plug_server starting... PluginId: 4 Path: "/home/a2/src/erl/data/SIG" FileMask: "(?i)(.*)\\.SIG$" 
ok
(n2@a2-X201)1> application:start(wd).
ok
(n2@a2-X201)2> 
(n3@a2-X201)1> application:start(wd).
ok
(n3@a2-X201)2> 

At the moment everything is Ok. As written in Erlang documentation: Application is running at node n1@a2-X201

Now kill node n1: Application was migrated to n2

(n2@a2-X201)2> 
=INFO REPORT==== 19-Jun-2011::15:46:28 ===
wd_plug_server starting... PluginId: 4 Path: "/home/a2/src/erl/data/SIG" FileMask: "(?i)(.*)\\.SIG$" 

Continue our game: kill node n2 One more time system works fine. We have our application at node n3

(n3@a2-X201)2> 
=INFO REPORT==== 19-Jun-2011::15:48:18 ===
wd_plug_server starting... PluginId: 4 Path: "/home/a2/src/erl/data/SIG" FileMask: "(?i)(.*)\\.SIG$" 

Now restore nodes n1 and n2. So:

Erlang R14B (erts-5.8.1) [source] [smp:4:4] [rq:4] [async-threads:0] [hipe] [kernel-poll:false]

Eshell V5.8.1  (abort with ^G)
(n1@a2-X201)1> 

Eshell V5.8.1  (abort with ^G)
(n2@a2-X201)1> 

Nodes n1 and n2 are back.
Looks like now I have to restart application manually: * Let's do it at node n2 first:

(n2@a2-X201)1> application:start(wd).
  • Looks like it hanged ...
  • Now restart it at n1
(n1@a2-X201)1> application:start(wd).

=INFO REPORT==== 19-Jun-2011::15:55:43 ===
wd_plug_server starting... PluginId: 4 Path: "/home/a2/src/erl/data/SIG" FileMask: "(?i)(.*)\\.SIG$" 

ok
(n1@a2-X201)2> 

It works. And node n2 also has returned OK:

Eshell V5.8.1  (abort with ^G)
(n2@a2-X201)1> application:start(wd).
ok
(n2@a2-X201)2> 

At node n3 we see:

=INFO REPORT==== 19-Jun-2011::15:55:43 ===
    application: wd
    exited: stopped
    type: temporary

In general, everything looks ok, as written in documentation, except for delay with starting application at node n2.

Now kill node n1 once more:

(n1@a2-X201)2> 
User switch command
 --> q
[a2@a2-X201 releases]$ 

Ops ... everything hangs. Application was not restarted at another node.

Actually, while I was writing this post I've realized that sometime everything id Ok, sometime I have a problem.

Any ideas, While there could be problems when restoring "primary" node nd killing it one more time?

回答1:

As explained over at Learn You Some Erlang (scroll to the bottom), distributed applications only work well when started as part of a release, not when you start them manually with application:start.



回答2:

Chances are the oddity you're seeing is likely to do with you restarting your application entirely on nodes n1/n2 while n3 is still running under the initial application initialisation.

If your application starts any system-wide processes and uses their pids rather than using registered names set with global, pg or pg2 for example, then you may be working with two sets of global state.

If this is the case, the recommended approach to take is to focus on adding/removing nodes from an existing application rather than restarting an application in it's entirety. This way nodes are leaving and joining into an existing set of initialised values.



标签: erlang otp