Dramatic slow-down when executing multiple process

I write a very simple code which contains summation of arrays by using both Fortran and Python. When I submit multiple (independent) jobs using shell, there will be dramatic slow-down when the number of threads is larger than one.

The Fortran version of my code is presented as follows

program main
implicit none
real*8 begin, end, Ht(2, 2), ls(4)
integer i, j, k, ii, jj, kk
integer,parameter::N_tiles = 20
integer,parameter::N_tilings = 100
integer,parameter::max_t_steps = 50
real*8,dimension(N_tiles*N_tilings,max_t_steps,5)::test_e, test_theta
real*8 rand_val

call random_seed()
do i = 1, N_tiles*N_tilings
  do j = 1, max_t_steps
    do k = 1, 5
      call random_number(rand_val)
      test_e(i, j, k) = rand_val
      call random_number(rand_val)
      test_theta(i, j, k) = rand_val
    end do
  end do
end do

call CPU_TIME(begin)
do i = 1, 1001
  do j = 1, 50
    test_theta = test_theta+0.5d0*test_e
  end do
end do
call CPU_TIME(end)

write(*, *) 'total time cost is : ', end-begin

end program main

and a shell-scipt is presented as follows

#!/bin/bash
gfortran -o result test.f90

nohup ./result &
nohup ./result &
nohup ./result &

As we can see, the main operation is the summation of array test_theta and test_e. These arrays are not large (3MB approximately) and the memory space of my computer is enough for this job. My work station has 6 cores with 12 threads. I try to submit 1, 2, 3, 4 and 5 jobs by using shell at one time, and the cost of time is presented as follows

| #jobs   |  1   |   2   |   3    |  4    |  5   |
| time(s) |  21  |   31  |   161  |  237  |  357 |

I expect that the time for n-thread job should be the same as the single-thread job once the number of threads is smaller than the number of cores we have, which is 6 here for my computer. However, we find dramatic slow-down here.

This problem still exists when I use Python to implement the same task

import numpy as np 
import time

N_tiles = 20
N_tilings = 100
max_t_steps = 50
theta = np.ones((N_tiles*N_tilings, max_t_steps, 5), dtype=np.float64)
e = np.ones((N_tiles*N_tilings, max_t_steps, 5), dtype=np.float64)

begin = time.clock()

for i in range(1001):
    for j in range(50):
        theta += 0.5*e

end = time.clock()
print('total time cost is {} s'.format(end-begin))

I don't know the reason and I wonder whether it is related to the size of L3 cache of CPU. That is, cache is too small for such multi-thread job. Maybe it is also related to the so-called "false sharing" problem. How can I fix this ?

This question is related to a former one dramatic slow down using multiprocess and numpy in python and here I just post a simple and typical example.

标签： python performance shell fortran

1条回答

smile是对你的礼貌

2楼-- · 2019-08-24 05:11

The code is likely slow when running multiple times, because you have more and more memory that must flow through the limited bandwidth memory buses.

If you run just one process, that works just with one array at one time, but enable OpenMP threading, it can be made faster:

integer*8 :: begin, end, rate
...

call system_clock(count_rate=rate)
call system_clock(count=begin)

!$omp parallel do
do i = 1, 1001
  do j = 1, 50
    test_theta = test_theta+0.5d0*test_e
  end do
end do
!$omp end parallel do

call system_clock(count=end)
write(*, *) 'total time cost is : ', (end-begin)*1.d0/rate

On a quad-core CPU:

> gfortran -O3 testperformance.f90 -o result
> ./result 
 total time cost is :    15.135917384000001
> gfortran -O3 testperformance.f90 -fopenmp -o result
> ./result 
 total time cost is :    3.9464441830000001

0人赞添加讨论(0) 举报

Dramatic slow-down when executing multiple process

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间