When I run in parallel a simple function (with different inputs per worker) the time it takes for a parallelized version code is consistently longer than if it was sequential.
The simple function is
addprocs(2)
@everywhere function simple(i,Q,vt_0,ct_0)
a=sum((rand(1:i),vt_0,ct_0))*2
return a
end
The Parallelized version of my code is
#In Parallel
tic()
N=10000;
v=zeros(N);
c=zeros(N);
vt_0=0;
ct_0=0;
for i=1:N
v[i]=fetch(remotecall(simple,2,i,1,vt_0,ct_0))
c[i]=fetch(remotecall(simple,3,i,2,vt_0,ct_0))
vt_0=v[i]
ct_0=c[i]
end
toc()
While the sequential code would be just calling the function in each iteration of the loop (no remote call()
nor fetch()
, I am thinking the way in which I call the workers in parallel in Julia with remotecall()
is the root of the problem, but I am not sure.
Can anybody help me figure out why this parallel code is so slow compared to just call the function? Or is this simple calculation not worth parallelizing it?
EDIT:
Here is the Sequential code:
#Sequential
tic()
N=10000;
v=zeros(N,2);
c=zeros(N,2);
vt_0=0;
ct_0=0;
for i=1:N
v[i]=simple(i,1,vt_0,ct_0)
c[i]=simple(i,2,vt_0,ct_0)
vt_0=v[i]
ct_0=c[i]
end
toc()
Even for the case of the very-"shallow"-computing inside the
simple()
, Julia documentation recommends to rather useremotecall_fetch()
instead offetch( remotecall( ... ) )
:The
[PARALLEL]
-process@spawn
/fetch()
overheadsThis is a very often overlooked topic, great to actively recognise and trying to reason about this subject. Actually this is a common root-cause of adverse impact on
[PARALLEL]
-process scheduling final performance. If indeed interested in details, getting both the mathematical model + an interactive UI-tool to simulate / evaluate the net-effects of the overhead-strict speedup formula on[PARALLEL]
code-executions for as much as 2 .. 8000+ CPU-cores, feel free to read more on this hereThe main "Suspect" here are the value-inter-process-dependencies:
One might be hidden - inside the system function
rand()
. In case a cryptographically strong implementation is used, each, indeed EACH call to therand()
must update the central source-of-randomness'-state. That means, that for this particular reason, all, indeed ALL the spawned processes have to have established and also maintained a queue to this central-shared-service, which will not be present at all ( as it can make zero source-of-randomness'-state updates troubles ) in a simple[SEQ]
-code-execution, but will require some additional hidden overheads ( MUTEX / LOCKS / value-update atomics et al ) in case of[PAR]
-code-execution units, that actually "share" this central resource ( being hidden to be strictly transactional, with a concurrency-capacity of 1, operating a 1:N-blocking-logic signalling, with none rollback ... ) and must execute and enforce the atomically safe update of the source-of-randomness'-state, before any next call to the source-of-randomness' service may get served.the other is {prior|next}-loop step dependency, the more once mutually crossed among the pair of calls to
simple()
-process instances. As illustrated below, this mutual inter-dependence actually makes all potentially spawned-calls to have to be arranged as a pure[SEQ]
-process-schedule, and the "remaining"sum()
is indeed not much meat to chew in[PARALLEL]
-process schedule.ACTUAL MUTUAL INTER-PROCESS DEPENDENCY
ILLUSTRATION DEPICTS IMPOSSIBILITY TO AT LEAST INJECT LOOP
SO AS TO MAKE
[PARALLEL]
-PROCESSING MORE EFFICIENT:While adding CSP-channels with explicit inter-process communication via
{ put!() | take!() }
Channel-methods may solve this dependency being communicated, guess what, the coroutines-scheduling will only add additional overheads, so expect to pay even more to get less.A minor note on raw-profiling:
In all cases, it is advisable to place a
tic() .. toc()
-bracket tight to the code-section under test, and avoid and exclude any and all memory-allocations and similar extremely long and noisy parts from the actual measured-execution: