I wrote a simple program ("controller") to run some computation on a separate node ("worker"). The reason being that if the worker node runs out of memory, the controller still works:
-module(controller).
-compile(export_all).
p(Msg,Args) -> io:format("~p " ++ Msg, [time() | Args]).
progress_monitor(P,N) ->
timer:sleep(5*60*1000),
p("killing the worker which was using strategy #~p~n", [N]),
exit(P, took_to_long).
start() ->
start(1).
start(Strat) ->
P = spawn('worker@localhost', worker, start, [Strat,self(),60000000000]),
p("starting worker using strategy #~p~n", [Strat]),
spawn(controller,progress_monitor,[P,Strat]),
monitor(process, P),
receive
{'DOWN', _, _, P, Info} ->
p("worker using strategy #~p died. reason: ~p~n", [Strat, Info]);
X ->
p("got result: ~p~n", [X])
end,
case Strat of
4 -> p("out of strategies. giving up~n", []);
_ -> timer:sleep(5000), % wait for node to come back
start(Strat + 1)
end.
To test it, I deliberately wrote 3 factorial implementations that will use up lots of memory and crash, and a fourth implementation which uses tail recursion to avoid taking too much space:
-module(worker).
-compile(export_all).
start(1,P,N) -> P ! factorial1(N);
start(2,P,N) -> P ! factorial2(N);
start(3,P,N) -> P ! factorial3(N);
start(4,P,N) -> P ! factorial4(N,1).
factorial1(0) -> 1;
factorial1(N) -> N*factorial1(N-1).
factorial2(N) ->
case N of
0 -> 1;
_ -> N*factorial2(N-1)
end.
factorial3(N) -> lists:foldl(fun(X,Y) -> X*Y end, 1, lists:seq(1,N)).
factorial4(0, A) -> A;
factorial4(N, A) -> factorial4(N-1, A*N).
Note even with the tail recursive version, I'm calling it with 60000000000, which will probably take days on my machine even with factorial4
. Here is the output of running the controller:
$ erl -sname 'controller@localhost'
Erlang R16B (erts-5.10.1) [source] [64-bit] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false]
Eshell V5.10.1 (abort with ^G)
(controller@localhost)1> c(worker).
{ok,worker}
(controller@localhost)2> c(controller).
{ok,controller}
(controller@localhost)3> controller:start().
{23,24,28} starting worker using strategy #1
{23,25,13} worker using strategy #1 died. reason: noconnection
{23,25,18} starting worker using strategy #2
{23,26,2} worker using strategy #2 died. reason: noconnection
{23,26,7} starting worker using strategy #3
{23,26,40} worker using strategy #3 died. reason: noconnection
{23,26,45} starting worker using strategy #4
{23,29,28} killing the worker which was using strategy #1
{23,29,29} worker using strategy #4 died. reason: took_to_long
{23,29,29} out of strategies. giving up
ok
It almost works, but worker #4 was killed too early (should have been close to 23:31:45, not 23:29:29). Looking deeper, only worker #1 was attempted to be killed, and no others. So worker #4 should not have died, yet it did. Why? We can even see that the reason was took_to_long
, and that progress_monitor
#1 started at 23:24:28, five minutes before 23:29:29. So it looks like progress_monitor
#1 killed worker #4 instead of worker #1. Why did it kill the wrong process?
Here is the output of the worker when I ran the controller:
$ while true; do erl -sname 'worker@localhost'; done
Erlang R16B (erts-5.10.1) [source] [64-bit] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false]
Eshell V5.10.1 (abort with ^G)
(worker@localhost)1>
Crash dump was written to: erl_crash.dump
eheap_alloc: Cannot allocate 2733560184 bytes of memory (of type "heap").
Aborted
Erlang R16B (erts-5.10.1) [source] [64-bit] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false]
Eshell V5.10.1 (abort with ^G)
(worker@localhost)1>
Crash dump was written to: erl_crash.dump
eheap_alloc: Cannot allocate 2733560184 bytes of memory (of type "heap").
Aborted
Erlang R16B (erts-5.10.1) [source] [64-bit] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false]
Eshell V5.10.1 (abort with ^G)
(worker@localhost)1>
Crash dump was written to: erl_crash.dump
eheap_alloc: Cannot allocate 2733560184 bytes of memory (of type "old_heap").
Aborted
Erlang R16B (erts-5.10.1) [source] [64-bit] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false]
Eshell V5.10.1 (abort with ^G)
(worker@localhost)1>