So, I am playing around with multiprocessing.Pool
and Numpy
, but it seems I missed some important point. Why is the pool
version much slower? I looked at htop
and I can see several processes be created, but they all share one of the CPUs adding up to ~100%.
$ cat test_multi.py
import numpy as np
from timeit import timeit
from multiprocessing import Pool
def mmul(matrix):
for i in range(100):
matrix = matrix * matrix
return matrix
if __name__ == '__main__':
matrices = []
for i in range(4):
matrices.append(np.random.random_integers(100, size=(1000, 1000)))
pool = Pool(8)
print timeit(lambda: map(mmul, matrices), number=20)
print timeit(lambda: pool.map(mmul, matrices), number=20)
$ python test_multi.py
16.0265390873
19.097837925
[update]
- changed to
timeit
for benchmarking processes - init Pool with a number of my cores
- changed computation so that there is more computation and less memory transfer (I hope)
Still no change. pool
version is still slower and I can see in htop
that only one core is used also several processes are spawned.
[update2]
At the moment I am reading about @Jan-Philip Gehrcke's suggestion to use multiprocessing.Process()
and Queue
. But in the meantime I would like to know:
- Why does my example work for tiago? What could be the reason it is not working on my machine1?
- Is in my example code any copying between the processes? I intended my code to give each thread one matrix of the matrices list.
- Is my code a bad example, because I use
Numpy
?
I learned that often one gets better answer, when the others know my end goal so: I have a lot of files, which are atm loaded and processed in a serial fashion. The processing is CPU intense, so I assume much could be gained by parallelization. My aim is it to call the python function that analyses a file in parallel. Furthermore this function is just an interface to C code, I assume, that makes a difference.
1 Ubuntu 12.04, Python 2.7.3, i7 860 @ 2.80 - Please leave a comment if you need more info.
[update3]
Here are the results from Stefano's example code. For some reason there is no speed up. :/
testing with 16 matrices
base 4.27
1 5.07
2 4.76
4 4.71
8 4.78
16 4.79
testing with 32 matrices
base 8.82
1 10.39
2 10.58
4 10.73
8 9.46
16 9.54
testing with 64 matrices
base 17.38
1 19.34
2 19.62
4 19.59
8 19.39
16 19.34
[update 4] answer to Jan-Philip Gehrcke's comment
Sorry that I haven't made myself clearer. As I wrote in Update 2 my main goal is it to parallelize many serial calls of a 3rd party Python library function. This function is an interface to some C code. I was recommended to use Pool
, but this didn't work, so I tried something simpler, the shown above example with numpy
. But also there I could not achieve a performance improvement, even though it looks for me 'emberassing parallelizable`. So I assume I must have missed something important. This information is what I am looking for with this question and bounty.
[update 5]
Thanks for all your tremendous input. But reading through your answers only creates more questions for me. For that reason I will read about the basics and create new SO questions when I have a clearer understanding of what I don't know.
Since you mention that you have a lot of files, I would suggest the following solution;
Pool.map()
to apply the function to the list of files.Since every instance now loads its own file, the only data passed around are filenames, not (potentially large) numpy arrays.
By default,
Pool
only uses n processes, where n is the number of CPUs on your machine. You need to specify how many processes you want it to use, likePool(5)
.See here for more info
I also noticed that when I ran numpy matrix multiplication inside of a Pool.map() function, it ran much slower on certain machines. My goal was to parallelize my work using Pool.map(), and run a process on each core of my machine. When things were running fast, the numpy matrix multiplication was only a small part of the overall work performed in parallel. When I looked at the CPU usage of the processes, I could see that each process could use e.g. 400+% CPU on the machines where it ran slow, but always <=100% on the machines where it ran fast. For me, the solution was to stop numpy from multithreading. It turns out that numpy was set up to multithread on exactly the machines where my Pool.map() was running slow. Evidently, if you are already parallelizing using Pool.map(), then having numpy also parallelize just creates interference. I just called
export MKL_NUM_THREADS=1
before running my Python code and it worked fast everywhere.Regarding the fact that all of your processes are running on the same CPU, see my answer here.
During import,
numpy
changes the CPU affinity of the parent process, such that when you later usePool
all of the worker processes that it spawns will end up vying for for the same core, rather than using all of the cores available on your machine.You can call
taskset
after you importnumpy
to reset the CPU affinity so that all cores are used:Output:
If you watch CPU useage using
top
while you run this script, you should see it using all of your cores when it executes the 'parallel' part. As others have pointed out, in your original example the overhead involved in pickling data, process creation etc. probably outweigh any possible benefit from parallelisation.Edit: I suspect that part of the reason why the single process seems to be consistently faster is that
numpy
may have some tricks for speeding up that element-wise matrix multiplication that it cannot use when the jobs are spread across multiple cores.For example, if I just use ordinary Python lists to compute the Fibonacci sequence, I can get a huge speedup from parallelisation. Likewise, if I do element-wise multiplication in a way that takes no advantage of vectorization, I get a similar speedup for the parallel version:
Output:
The unpredictable competition between communication overhead and computation speedup is definitely the issue here. What you are observing is perfectly fine. Whether you get a net speed-up depends on many factors and is something that has to be quantified properly (as you did).
So why is
multiprocessing
so "unexpectedly slow" in your case?multiprocessing
'smap
andmap_async
functions actually pickle Python objects back and forth through pipes that connect the parent with the child processes. This may take a considerable amount of time. During that time, the child processes have almost nothing to do, which is what to see inhtop
. Between different systems, there might be a considerable pipe transport performance difference, which is also why for some people your pool code is faster than your single CPU code, although for you it is not (other factors might come into play here, this is just an example in order to explain the effect).What can you do to make it faster?
Don't pickle the input on POSIX-compliant systems.
If you are on Unix, you can get around the parent->child communication overhead via taking advantage of POSIX' process fork behavior (copy memory on write):
Create your job input (e.g. a list of large matrices) to work on in the parent process in a globally accessible variable. Then create worker processes by calling
multiprocessing.Process()
yourself. In the children, grab the job input from the global variable. Simply expressed, this makes the child access the memory of the parent without any communication overhead (*, explanation below). Send the result back to the parent, through e.g. amultiprocessing.Queue
. This will save a lot of communication overhead, especially if the output is small compared to the input. This method won't work on e.g. Windows, becausemultiprocessing.Process()
there creates an entirely new Python process that does not inherit the state of the parent.Make use of numpy multithreading. Depending on your actual calculation task, it might happen that involving
multiprocessing
won't help at all. If you compile numpy yourself and enable OpenMP directives, then operations on larges matrices might become very efficiently multithreaded (and distributed over many CPU cores; the GIL is no limiting factor here) by themselves. Basically, this is the most efficient usage of multiple CPU cores you can get in the context of numpy/scipy.*The child cannot directly access the parent's memory in general. However, after
fork()
, parent and child are in an equivalent state. It would be stupid to copy the entire memory of the parent to another place in the RAM. That's why the copy-on-write principle jumps in. As long as the child does not change its memory state, it actually accesses the parent's memory. Only upon modification, the corresponding bits and pieces are copied into the memory space of the child.Major edit:
Let me add a piece of code that crunches a large amount of input data with multiple worker processes and follows the advice "1. Don't pickle the input on POSIX-compliant systems.". Furthermore, the amount of information transferred back to the worker manager (the parent process) is quite low. The heavy computation part of this example is a single value decomposition. It can make heavy use of OpenMP. I have executed the example multiple times:
OMP_NUM_THREADS=1
, so each worker process creates a maximum load of 100 %. There, the mentioned number-of-workers-compute-time scaling behavior is almost linear and the net speedup factor up corresponds to the number of workers involved.OMP_NUM_THREADS=4
, so that each process creates a maximum load of 400 % (via spawning 4 OpenMP threads). My machine has 16 real cores, so 4 processes with max 400 % load each will almost get the maximum performance out of the machine. The scaling is not perfectly linear anymore and the speedup factor is not the number of workers involved, but the absolute calculation time becomes significantly reduced compared toOMP_NUM_THREADS=1
and time still decreases significantly with the number of worker processes.OMP_NUM_THREADS=4
. It results in an average system load of 1253 %.OMP_NUM_THREADS=5
. It results in an average system load of 1598 %, which suggests that we got everything from that 16 core machine. However, the actual computation wall time does not improve compared to the latter case.The code:
The output:
Your code is correct. I just ran it my system (with 2 cores, hyperthreading) and obtained the following results:
I looked at the processes and, as expected, the parallel part showing several processes working at near 100%. This must be something in your system or python installation.