Wrong process getting killed on other node?

2019-08-30 19:37发布

问题:

I wrote a simple program ("controller") to run some computation on a separate node ("worker"). The reason being that if the worker node runs out of memory, the controller still works:

-module(controller).
-compile(export_all).

p(Msg,Args) -> io:format("~p " ++ Msg, [time() | Args]).

progress_monitor(P,N) ->
    timer:sleep(5*60*1000),
    p("killing the worker which was using strategy #~p~n", [N]),
    exit(P, took_to_long).

start() ->
    start(1).
start(Strat) ->
    P = spawn('worker@localhost', worker, start, [Strat,self(),60000000000]),
    p("starting worker using strategy #~p~n", [Strat]),
    spawn(controller,progress_monitor,[P,Strat]),
    monitor(process, P),
    receive
        {'DOWN', _, _, P, Info} ->
            p("worker using strategy #~p died. reason: ~p~n", [Strat, Info]);
        X ->
            p("got result: ~p~n", [X])
    end,
    case Strat of
        4 -> p("out of strategies. giving up~n", []);
        _ -> timer:sleep(5000), % wait for node to come back
             start(Strat + 1)
    end.

To test it, I deliberately wrote 3 factorial implementations that will use up lots of memory and crash, and a fourth implementation which uses tail recursion to avoid taking too much space:

-module(worker).
-compile(export_all).

start(1,P,N) -> P ! factorial1(N);
start(2,P,N) -> P ! factorial2(N);
start(3,P,N) -> P ! factorial3(N);
start(4,P,N) -> P ! factorial4(N,1).

factorial1(0) -> 1;
factorial1(N) -> N*factorial1(N-1).

factorial2(N) ->
    case N of
        0 -> 1;
        _ -> N*factorial2(N-1)
    end.

factorial3(N) -> lists:foldl(fun(X,Y) -> X*Y end, 1, lists:seq(1,N)).

factorial4(0, A) -> A;
factorial4(N, A) -> factorial4(N-1, A*N).

Note even with the tail recursive version, I'm calling it with 60000000000, which will probably take days on my machine even with factorial4. Here is the output of running the controller:

$ erl -sname 'controller@localhost'
Erlang R16B (erts-5.10.1) [source] [64-bit] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false]

Eshell V5.10.1  (abort with ^G)
(controller@localhost)1> c(worker).
{ok,worker}
(controller@localhost)2> c(controller).
{ok,controller}
(controller@localhost)3> controller:start().
{23,24,28} starting worker using strategy #1
{23,25,13} worker using strategy #1 died. reason: noconnection
{23,25,18} starting worker using strategy #2
{23,26,2} worker using strategy #2 died. reason: noconnection
{23,26,7} starting worker using strategy #3
{23,26,40} worker using strategy #3 died. reason: noconnection
{23,26,45} starting worker using strategy #4
{23,29,28} killing the worker which was using strategy #1
{23,29,29} worker using strategy #4 died. reason: took_to_long
{23,29,29} out of strategies. giving up
ok

It almost works, but worker #4 was killed too early (should have been close to 23:31:45, not 23:29:29). Looking deeper, only worker #1 was attempted to be killed, and no others. So worker #4 should not have died, yet it did. Why? We can even see that the reason was took_to_long, and that progress_monitor #1 started at 23:24:28, five minutes before 23:29:29. So it looks like progress_monitor #1 killed worker #4 instead of worker #1. Why did it kill the wrong process?

Here is the output of the worker when I ran the controller:

$ while true; do erl -sname 'worker@localhost'; done
Erlang R16B (erts-5.10.1) [source] [64-bit] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false]

Eshell V5.10.1  (abort with ^G)
(worker@localhost)1> 
Crash dump was written to: erl_crash.dump
eheap_alloc: Cannot allocate 2733560184 bytes of memory (of type "heap").
Aborted
Erlang R16B (erts-5.10.1) [source] [64-bit] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false]

Eshell V5.10.1  (abort with ^G)
(worker@localhost)1> 
Crash dump was written to: erl_crash.dump
eheap_alloc: Cannot allocate 2733560184 bytes of memory (of type "heap").
Aborted
Erlang R16B (erts-5.10.1) [source] [64-bit] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false]

Eshell V5.10.1  (abort with ^G)
(worker@localhost)1> 
Crash dump was written to: erl_crash.dump
eheap_alloc: Cannot allocate 2733560184 bytes of memory (of type "old_heap").
Aborted
Erlang R16B (erts-5.10.1) [source] [64-bit] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false]

Eshell V5.10.1  (abort with ^G)
(worker@localhost)1> 

回答1:

There are several issues, and eventually you experienced creation number wrap around.

Since you do not cancel the progress_monitor process, it will send always an exit signal after 5 minutes.

The computation is long and/or the VM is slow, hence process 4 is still running 5 minutes after the progress monitor for process 1 was started.

The 4 worker nodes were started sequentially with the same name workers@localhost, and the creation numbers of the first and the fourth node are the same.

Creation numbers (creation field in references and pids) are a mechanism to prevent pids and references created by a crashed node to be interpreted by a new node with the same name. Exactly what you expect in your code when you try to kill worker 1 after the node is long gone, you don't intend to kill a process in a restarted node.

When a node sends a pid or a reference, it encodes its creation number. When it receives a pid or a reference from another node, it checks that the creation number in the pid matches its own creation number. The creation number are attributed by epmd following the 1,2,3 sequence.

Here, unfortunately, when the 4th node gets the exit message, the creation number matches because this sequence wrapped. Since the nodes spawn the process and did the exact same thing before (initialized erlang), the pid of the worker of node 4 matches the pid of the worker of node 1.

As a result, the controller eventually kills worker 4 believing it is worker 1.

To avoid this, you need something more robust than the creation number if there can be 4 workers within the lifespan of a pid or a reference in the controller.