I write a very simple code which contains summation of arrays by using both Fortran and Python. When I submit multiple (independent) jobs using shell, there will be dramatic slow-down when the number of threads is larger than one.
The Fortran version of my code is presented as follows
program main
implicit none
real*8 begin, end, Ht(2, 2), ls(4)
integer i, j, k, ii, jj, kk
integer,parameter::N_tiles = 20
integer,parameter::N_tilings = 100
integer,parameter::max_t_steps = 50
real*8,dimension(N_tiles*N_tilings,max_t_steps,5)::test_e, test_theta
real*8 rand_val
call random_seed()
do i = 1, N_tiles*N_tilings
do j = 1, max_t_steps
do k = 1, 5
call random_number(rand_val)
test_e(i, j, k) = rand_val
call random_number(rand_val)
test_theta(i, j, k) = rand_val
end do
end do
end do
call CPU_TIME(begin)
do i = 1, 1001
do j = 1, 50
test_theta = test_theta+0.5d0*test_e
end do
end do
call CPU_TIME(end)
write(*, *) 'total time cost is : ', end-begin
end program main
and a shell-scipt
is presented as follows
gfortran -o result test.f90
nohup ./result &
nohup ./result &
nohup ./result &
As we can see, the main operation is the summation of array test_theta
and test_e
. These arrays are not large (3MB approximately) and the memory space of my computer is enough for this job. My work station has 6 cores with 12 threads. I try to submit 1, 2, 3, 4 and 5 jobs by using shell at one time, and the cost of time is presented as follows
| #jobs | 1 | 2 | 3 | 4 | 5 |
| time(s) | 21 | 31 | 161 | 237 | 357 |
I expect that the time for n-thread job should be the same as the single-thread job once the number of threads is smaller than the number of cores we have, which is 6 here for my computer. However, we find dramatic slow-down here.
This problem still exists when I use Python to implement the same task
import numpy as np
import time
N_tiles = 20
N_tilings = 100
max_t_steps = 50
theta = np.ones((N_tiles*N_tilings, max_t_steps, 5), dtype=np.float64)
e = np.ones((N_tiles*N_tilings, max_t_steps, 5), dtype=np.float64)
begin = time.clock()
for i in range(1001):
for j in range(50):
theta += 0.5*e
end = time.clock()
print('total time cost is {} s'.format(end-begin))
I don't know the reason and I wonder whether it is related to the size of L3 cache of CPU. That is, cache is too small for such multi-thread job. Maybe it is also related to the so-called "false sharing" problem. How can I fix this ?
This question is related to a former one dramatic slow down using multiprocess and numpy in python and here I just post a simple and typical example.
The code is likely slow when running multiple times, because you have more and more memory that must flow through the limited bandwidth memory buses.
If you run just one process, that works just with one array at one time, but enable OpenMP threading, it can be made faster:
On a quad-core CPU: