I write a very simple code which contains summation of arrays by using both Fortran and Python. When I submit multiple (independent) jobs using shell, there will be dramatic slow-down when the number of threads is larger than one.
The Fortran version of my code is presented as follows
program main
implicit none
real*8 begin, end, Ht(2, 2), ls(4)
integer i, j, k, ii, jj, kk
integer,parameter::N_tiles = 20
integer,parameter::N_tilings = 100
integer,parameter::max_t_steps = 50
real*8,dimension(N_tiles*N_tilings,max_t_steps,5)::test_e, test_theta
real*8 rand_val
call random_seed()
do i = 1, N_tiles*N_tilings
do j = 1, max_t_steps
do k = 1, 5
call random_number(rand_val)
test_e(i, j, k) = rand_val
call random_number(rand_val)
test_theta(i, j, k) = rand_val
end do
end do
end do
call CPU_TIME(begin)
do i = 1, 1001
do j = 1, 50
test_theta = test_theta+0.5d0*test_e
end do
end do
call CPU_TIME(end)
write(*, *) 'total time cost is : ', end-begin
end program main
and a shell-scipt
is presented as follows
#!/bin/bash
gfortran -o result test.f90
nohup ./result &
nohup ./result &
nohup ./result &
As we can see, the main operation is the summation of array test_theta
and test_e
. These arrays are not large (3MB approximately) and the memory space of my computer is enough for this job. My work station has 6 cores with 12 threads. I try to submit 1, 2, 3, 4 and 5 jobs by using shell at one time, and the cost of time is presented as follows
| #jobs | 1 | 2 | 3 | 4 | 5 |
| time(s) | 21 | 31 | 161 | 237 | 357 |
I expect that the time for n-thread job should be the same as the single-thread job once the number of threads is smaller than the number of cores we have, which is 6 here for my computer. However, we find dramatic slow-down here.
This problem still exists when I use Python to implement the same task
import numpy as np
import time
N_tiles = 20
N_tilings = 100
max_t_steps = 50
theta = np.ones((N_tiles*N_tilings, max_t_steps, 5), dtype=np.float64)
e = np.ones((N_tiles*N_tilings, max_t_steps, 5), dtype=np.float64)
begin = time.clock()
for i in range(1001):
for j in range(50):
theta += 0.5*e
end = time.clock()
print('total time cost is {} s'.format(end-begin))
I don't know the reason and I wonder whether it is related to the size of L3 cache of CPU. That is, cache is too small for such multi-thread job. Maybe it is also related to the so-called "false sharing" problem. How can I fix this ?
This question is related to a former one dramatic slow down using multiprocess and numpy in python and here I just post a simple and typical example.