Julia: Parallel code slower than Sequential code,

2019-04-11 02:06发布

When I run in parallel a simple function (with different inputs per worker) the time it takes for a parallelized version code is consistently longer than if it was sequential.

The simple function is

addprocs(2)
@everywhere function simple(i,Q,vt_0,ct_0)
  a=sum((rand(1:i),vt_0,ct_0))*2
  return a
end

The Parallelized version of my code is

#In Parallel
tic()
N=10000;
v=zeros(N);
c=zeros(N);
vt_0=0;
ct_0=0;

for i=1:N

  v[i]=fetch(remotecall(simple,2,i,1,vt_0,ct_0))

  c[i]=fetch(remotecall(simple,3,i,2,vt_0,ct_0))

  vt_0=v[i]
  ct_0=c[i]

end
toc()

While the sequential code would be just calling the function in each iteration of the loop (no remote call() nor fetch(), I am thinking the way in which I call the workers in parallel in Julia with remotecall() is the root of the problem, but I am not sure.

Can anybody help me figure out why this parallel code is so slow compared to just call the function? Or is this simple calculation not worth parallelizing it?

EDIT:
Here is the Sequential code:

#Sequential
tic()
N=10000;
v=zeros(N,2);
c=zeros(N,2);
vt_0=0;
ct_0=0;

for i=1:N

  v[i]=simple(i,1,vt_0,ct_0)

  c[i]=simple(i,2,vt_0,ct_0)

  vt_0=v[i]
  ct_0=c[i]
end
toc()

1条回答
劳资没心,怎么记你
2楼-- · 2019-04-11 02:34

Even for the case of the very-"shallow"-computing inside the simple(), Julia documentation recommends to rather use remotecall_fetch() instead of fetch( remotecall( ... ) ):

The function remotecall_fetch() exists for this purpose. It is equivalent to fetch(remotecall(...)) but is more efficient.


The [PARALLEL]-process @spawn / fetch() overheads

This is a very often overlooked topic, great to actively recognise and trying to reason about this subject. Actually this is a common root-cause of adverse impact on [PARALLEL]-process scheduling final performance. If indeed interested in details, getting both the mathematical model + an interactive UI-tool to simulate / evaluate the net-effects of the overhead-strict speedup formula on [PARALLEL] code-executions for as much as 2 .. 8000+ CPU-cores, feel free to read more on this here


The main "Suspect" here are the value-inter-process-dependencies:

  • One might be hidden - inside the system function rand(). In case a cryptographically strong implementation is used, each, indeed EACH call to the rand() must update the central source-of-randomness'-state. That means, that for this particular reason, all, indeed ALL the spawned processes have to have established and also maintained a queue to this central-shared-service, which will not be present at all ( as it can make zero source-of-randomness'-state updates troubles ) in a simple [SEQ]-code-execution, but will require some additional hidden overheads ( MUTEX / LOCKS / value-update atomics et al ) in case of [PAR]-code-execution units, that actually "share" this central resource ( being hidden to be strictly transactional, with a concurrency-capacity of 1, operating a 1:N-blocking-logic signalling, with none rollback ... ) and must execute and enforce the atomically safe update of the source-of-randomness'-state, before any next call to the source-of-randomness' service may get served.

  • the other is {prior|next}-loop step dependency, the more once mutually crossed among the pair of calls to simple()-process instances. As illustrated below, this mutual inter-dependence actually makes all potentially spawned-calls to have to be arranged as a pure [SEQ]-process-schedule, and the "remaining" sum() is indeed not much meat to chew in [PARALLEL]-process schedule.


ACTUAL MUTUAL INTER-PROCESS DEPENDENCY
ILLUSTRATION DEPICTS IMPOSSIBILITY TO AT LEAST INJECT LOOP
SO AS TO MAKE [PARALLEL]-PROCESSING MORE EFFICIENT:

//                           +-------------<---- a way to "inject" a for-loop
//                           |  +----------<---- NOT CONSUMED AT ALL
//                           |  |
//                           |  |             +--******* BUT ********* DEPENDENCY
//                           |  |             |
//                           |  |             v
//                           |  |  +-------<-->- v[]-{prior|next}-step DEPENDENCY
//                           |  |  |     +-<-->- c[]-{prior|next}-step DEPENDENCY
//                           |  |  |     |
//          function simple( i, Q, vt_0, ct_0 )
@everywhere function simple( N, Q, vt_0, ct_0, aVEC )
   for     i  = 1:N
      aVEC[i] = sum( (  rand(1:i), vt_0, ct_0 ) )*2
      //   a  = sum( (  rand(1:i), vt_0, ct_0 ) )*2
      // ret a
end

While adding CSP-channels with explicit inter-process communication via { put!() | take!() } Channel-methods may solve this dependency being communicated, guess what, the coroutines-scheduling will only add additional overheads, so expect to pay even more to get less.


A minor note on raw-profiling:

In all cases, it is advisable to place a tic() .. toc()-bracket tight to the code-section under test, and avoid and exclude any and all memory-allocations and similar extremely long and noisy parts from the actual measured-execution:

// ------------------------------------------------------------<SuT>-START
tic()
for i=1:N

  v[i]=fetch(remotecall(simple,2,i,1,vt_0,ct_0))
  c[i]=fetch(remotecall(simple,3,i,2,vt_0,ct_0))

  vt_0=v[i]
  ct_0=c[i]

end
toc()
// ------------------------------------------------------------<SuT>-FINISH
查看更多
登录 后发表回答