Dramatic slow-down when executing multiple process

2019-08-24 05:14发布

问题:

I write a very simple code which contains summation of arrays by using both Fortran and Python. When I submit multiple (independent) jobs using shell, there will be dramatic slow-down when the number of threads is larger than one.

The Fortran version of my code is presented as follows

program main
implicit none
real*8 begin, end, Ht(2, 2), ls(4)
integer i, j, k, ii, jj, kk
integer,parameter::N_tiles = 20
integer,parameter::N_tilings = 100
integer,parameter::max_t_steps = 50
real*8,dimension(N_tiles*N_tilings,max_t_steps,5)::test_e, test_theta
real*8 rand_val

call random_seed()
do i = 1, N_tiles*N_tilings
  do j = 1, max_t_steps
    do k = 1, 5
      call random_number(rand_val)
      test_e(i, j, k) = rand_val
      call random_number(rand_val)
      test_theta(i, j, k) = rand_val
    end do
  end do
end do

call CPU_TIME(begin)
do i = 1, 1001
  do j = 1, 50
    test_theta = test_theta+0.5d0*test_e
  end do
end do
call CPU_TIME(end)

write(*, *) 'total time cost is : ', end-begin

end program main

and a shell-scipt is presented as follows

#!/bin/bash
gfortran -o result test.f90

nohup ./result &
nohup ./result &
nohup ./result &

As we can see, the main operation is the summation of array test_theta and test_e. These arrays are not large (3MB approximately) and the memory space of my computer is enough for this job. My work station has 6 cores with 12 threads. I try to submit 1, 2, 3, 4 and 5 jobs by using shell at one time, and the cost of time is presented as follows

| #jobs   |  1   |   2   |   3    |  4    |  5   |
| time(s) |  21  |   31  |   161  |  237  |  357 | 

I expect that the time for n-thread job should be the same as the single-thread job once the number of threads is smaller than the number of cores we have, which is 6 here for my computer. However, we find dramatic slow-down here.

This problem still exists when I use Python to implement the same task

import numpy as np 
import time

N_tiles = 20
N_tilings = 100
max_t_steps = 50
theta = np.ones((N_tiles*N_tilings, max_t_steps, 5), dtype=np.float64)
e = np.ones((N_tiles*N_tilings, max_t_steps, 5), dtype=np.float64)

begin = time.clock()

for i in range(1001):
    for j in range(50):
        theta += 0.5*e

end = time.clock()
print('total time cost is {} s'.format(end-begin))

I don't know the reason and I wonder whether it is related to the size of L3 cache of CPU. That is, cache is too small for such multi-thread job. Maybe it is also related to the so-called "false sharing" problem. How can I fix this ?

This question is related to a former one dramatic slow down using multiprocess and numpy in python and here I just post a simple and typical example.

回答1:

The code is likely slow when running multiple times, because you have more and more memory that must flow through the limited bandwidth memory buses.

If you run just one process, that works just with one array at one time, but enable OpenMP threading, it can be made faster:

integer*8 :: begin, end, rate
...

call system_clock(count_rate=rate)
call system_clock(count=begin)

!$omp parallel do
do i = 1, 1001
  do j = 1, 50
    test_theta = test_theta+0.5d0*test_e
  end do
end do
!$omp end parallel do

call system_clock(count=end)
write(*, *) 'total time cost is : ', (end-begin)*1.d0/rate

On a quad-core CPU:

> gfortran -O3 testperformance.f90 -o result
> ./result 
 total time cost is :    15.135917384000001
> gfortran -O3 testperformance.f90 -fopenmp -o result
> ./result 
 total time cost is :    3.9464441830000001