错误的过程中被杀另一个节点上?(Wrong process getting killed on ot

2019-11-02 09:18发布

我写了一个简单的程序(“控制器”),以一个单独的节点(“工人”)上运行一些计算。 原因是,如果工人节点运行的内存,控制器仍然有效:

-module(controller).
-compile(export_all).

p(Msg,Args) -> io:format("~p " ++ Msg, [time() | Args]).

progress_monitor(P,N) ->
    timer:sleep(5*60*1000),
    p("killing the worker which was using strategy #~p~n", [N]),
    exit(P, took_to_long).

start() ->
    start(1).
start(Strat) ->
    P = spawn('worker@localhost', worker, start, [Strat,self(),60000000000]),
    p("starting worker using strategy #~p~n", [Strat]),
    spawn(controller,progress_monitor,[P,Strat]),
    monitor(process, P),
    receive
        {'DOWN', _, _, P, Info} ->
            p("worker using strategy #~p died. reason: ~p~n", [Strat, Info]);
        X ->
            p("got result: ~p~n", [X])
    end,
    case Strat of
        4 -> p("out of strategies. giving up~n", []);
        _ -> timer:sleep(5000), % wait for node to come back
             start(Strat + 1)
    end.

为了测试它,我特意写3个,将占用大量的内存和崩溃,以及使用尾递归来避免占用过多空间第四种实现阶乘实现:

-module(worker).
-compile(export_all).

start(1,P,N) -> P ! factorial1(N);
start(2,P,N) -> P ! factorial2(N);
start(3,P,N) -> P ! factorial3(N);
start(4,P,N) -> P ! factorial4(N,1).

factorial1(0) -> 1;
factorial1(N) -> N*factorial1(N-1).

factorial2(N) ->
    case N of
        0 -> 1;
        _ -> N*factorial2(N-1)
    end.

factorial3(N) -> lists:foldl(fun(X,Y) -> X*Y end, 1, lists:seq(1,N)).

factorial4(0, A) -> A;
factorial4(N, A) -> factorial4(N-1, A*N).

注意即使同尾递归版本,我用600亿调用它,这可能会需要几天我的机器上,即使factorial4 。 下面是运行控制器的输出:

$ erl -sname 'controller@localhost'
Erlang R16B (erts-5.10.1) [source] [64-bit] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false]

Eshell V5.10.1  (abort with ^G)
(controller@localhost)1> c(worker).
{ok,worker}
(controller@localhost)2> c(controller).
{ok,controller}
(controller@localhost)3> controller:start().
{23,24,28} starting worker using strategy #1
{23,25,13} worker using strategy #1 died. reason: noconnection
{23,25,18} starting worker using strategy #2
{23,26,2} worker using strategy #2 died. reason: noconnection
{23,26,7} starting worker using strategy #3
{23,26,40} worker using strategy #3 died. reason: noconnection
{23,26,45} starting worker using strategy #4
{23,29,28} killing the worker which was using strategy #1
{23,29,29} worker using strategy #4 died. reason: took_to_long
{23,29,29} out of strategies. giving up
ok

它几乎工作,但工人#4被杀得太早(应该已经接近二十三时31分45秒,而不是23点29分29秒)。 更深层次看,只有工人#1试图杀死,并没有其他人。 所以工人#4应该不会死的,但它没有。 为什么? 我们甚至可以看到原因是took_to_long ,而且progress_monitor #1日开始,在23时24分28秒,23时29分29秒之前五分钟。 所以它看起来像progress_monitor #1被杀工人#4,而不是工人#1。 为什么会杀错过程?

这是当我跑控制器工人的输出:

$ while true; do erl -sname 'worker@localhost'; done
Erlang R16B (erts-5.10.1) [source] [64-bit] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false]

Eshell V5.10.1  (abort with ^G)
(worker@localhost)1> 
Crash dump was written to: erl_crash.dump
eheap_alloc: Cannot allocate 2733560184 bytes of memory (of type "heap").
Aborted
Erlang R16B (erts-5.10.1) [source] [64-bit] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false]

Eshell V5.10.1  (abort with ^G)
(worker@localhost)1> 
Crash dump was written to: erl_crash.dump
eheap_alloc: Cannot allocate 2733560184 bytes of memory (of type "heap").
Aborted
Erlang R16B (erts-5.10.1) [source] [64-bit] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false]

Eshell V5.10.1  (abort with ^G)
(worker@localhost)1> 
Crash dump was written to: erl_crash.dump
eheap_alloc: Cannot allocate 2733560184 bytes of memory (of type "old_heap").
Aborted
Erlang R16B (erts-5.10.1) [source] [64-bit] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false]

Eshell V5.10.1  (abort with ^G)
(worker@localhost)1> 

Answer 1:

有几个问题,最后你有经验的创作数量环绕。

既然你不取消progress_monitor过程中, 它会在5分钟后,总是发出退出信号

的计算是长的和/或虚拟机是缓慢的,因此处理4仍在运行 5分钟处理1进度监视器开始后。

4个工作节点都具有相同的名称开始依次workers@localhost ,并且第一和第四点的创建数字是相同的

创建数字(参考文献和创造的PID场)是防止崩溃节点创建PID和引用由具有相同名称的新节点解释的机制。 正是你在你的代码会发生什么,当你试图杀死工人1后的节点早就没了,你不打算重新启动的节点终止进程。

当一个节点发送一个PID或参考, 它编码其创建数 。 当它接收到PID或从另一节点的引用,它会检查在PID创建数字匹配其自己的创作数。 创建数由归因epmd 以下的1,2,3顺序 。

在这里,不幸的是,当第4节点获得退出消息, 创作数量相匹配,因为这个序列包裹 。 因为节点产卵的处理并做同样的事情之前(初始化的erlang),节点4点的匹配节点1的工人的PID的工人的PID。

其结果是,控制器最终杀死工人4相信这是工人1。

为了避免这种情况,你需要的东西不是创建数更强劲,如果有可以在PID或控制器的引用的生命周期内是4名工人。



文章来源: Wrong process getting killed on other node?