在同一时间执行多个进程时戏剧性减慢(Dramatic slow-down when executin

我写了一个非常简单的代码，同时使用Fortran语言和Python包含数组的总和。当我使用shell提交多个（独立）的工作，将有显着的减速时的线程数大于一。

我的代码的FORTRAN版本呈现如下

program main
implicit none
real*8 begin, end, Ht(2, 2), ls(4)
integer i, j, k, ii, jj, kk
integer,parameter::N_tiles = 20
integer,parameter::N_tilings = 100
integer,parameter::max_t_steps = 50
real*8,dimension(N_tiles*N_tilings,max_t_steps,5)::test_e, test_theta
real*8 rand_val

call random_seed()
do i = 1, N_tiles*N_tilings
  do j = 1, max_t_steps
    do k = 1, 5
      call random_number(rand_val)
      test_e(i, j, k) = rand_val
      call random_number(rand_val)
      test_theta(i, j, k) = rand_val
    end do
  end do
end do

call CPU_TIME(begin)
do i = 1, 1001
  do j = 1, 50
    test_theta = test_theta+0.5d0*test_e
  end do
end do
call CPU_TIME(end)

write(*, *) 'total time cost is : ', end-begin

end program main

和一个shell-scipt呈现如下

#!/bin/bash
gfortran -o result test.f90

nohup ./result &
nohup ./result &
nohup ./result &

我们可以看到，主要的操作数组的总和test_theta和test_e 。这些阵列都不大（大约3MB）和我的电脑的存储空间就足够了这份工作。我的工作站有6个内核和12个线程。我尝试通过使用壳一次提交1，2，3，4和5的工作，并且被呈现的时间为代价如下

| #jobs   |  1   |   2   |   3    |  4    |  5   |
| time(s) |  21  |   31  |   161  |  237  |  357 |

我预计，对于正线作业的时间应该是一样的单线程任务，一旦线程的数量比我们有核的数量，这是6在这里为我的电脑小。然而，我们在这里发现显着放缓。

当我使用Python来实现相同的任务，这个问题仍然存在

import numpy as np 
import time

N_tiles = 20
N_tilings = 100
max_t_steps = 50
theta = np.ones((N_tiles*N_tilings, max_t_steps, 5), dtype=np.float64)
e = np.ones((N_tiles*N_tilings, max_t_steps, 5), dtype=np.float64)

begin = time.clock()

for i in range(1001):
    for j in range(50):
        theta += 0.5*e

end = time.clock()
print('total time cost is {} s'.format(end-begin))

我不知道原因，我不知道是否它关系到CPU的L3高速缓存的大小。也就是说，缓存是这样的多线程任务太小。也许这也涉及到所谓的“假共享”的问题。我怎样才能解决这个问题？

这个问题关系到前一个戏剧性的减慢使用python的多进程和numpy的，在这里我只是张贴一个简单而典型的例子。

运行多个时候，因为你有更多，更必须通过有限的带宽的内存总线流量存储装置的代码可能是缓慢的。

如果您运行只有一个过程，即在同一时间只用一个阵列工作，但启用OpenMP线程，它可以进行得更快：

integer*8 :: begin, end, rate
...

call system_clock(count_rate=rate)
call system_clock(count=begin)

!$omp parallel do
do i = 1, 1001
  do j = 1, 50
    test_theta = test_theta+0.5d0*test_e
  end do
end do
!$omp end parallel do

call system_clock(count=end)
write(*, *) 'total time cost is : ', (end-begin)*1.d0/rate

在四核CPU：

> gfortran -O3 testperformance.f90 -o result
> ./result 
 total time cost is :    15.135917384000001
> gfortran -O3 testperformance.f90 -fopenmp -o result
> ./result 
 total time cost is :    3.9464441830000001