可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

Hopefully you could me explaining this thing. I'm working with Fortran and actually writing my code on CFD topic, and below (just for sake of simplicity and just for example) are short explanations of my case:

I should use one 2D array A(i,j) and one 1D array B(i)
I have to do 2 times looping, which is the first looping should be 50,000 times and the second one is 5 times (can't be changed).
The point number 2 above should be looped 10,000 times.

I write the codes with 2 versions (I called them Prog_A and Prog_B).

The first one is as shown below:

PROGRAM PROG_A
 REAL*8, DIMENSION(50000,5):: A
 REAL*8, DIMENSION(50000)::B  
 REAL*8:: TIME1,TIME2
       !Just for initial value
           DO I=1, 50000
              A(I,1)=1.0
              A(I,2)=2.0
              A(I,3)=3.0
              A(I,4)=4.0
              A(I,5)=5.0

              B(I)=I 
           END DO
       !Computing computer's running time (start)
           CALL CPU_TIME(TIME1)
                 DO K=1, 100000
                    DO I=1, 50000             !Array should be computed first for 50,000 elements (can't be changed)
                        DO J=1, 5
                           A(I,J)=A(I,J)+SQRT(B(I))   
                        END DO
                    END DO
                 END DO
       !Computing computer's running time (finish)
           CALL CPU_TIME(TIME2)
             PRINT *, 'Elapsed real time = ', TIME2-TIME1, 'second(s)'
END PROGRAM PROG_A

The second one is:

PROGRAM PROG_B
 REAL*8, DIMENSION(5,50000):: A
 REAL*8, DIMENSION(50000)::B  
 REAL*8:: TIME1,TIME2
       !Just for initial value
           DO J=1, 50000
              A(1,J)=1.0
              A(2,J)=2.0
              A(3,J)=3.0
              A(4,J)=4.0
              A(5,J)=5.0

              B(J)=J 
           END DO
       !Computing computer's running time (start)
           CALL CPU_TIME(TIME1)
                 DO K=1, 100000
                    DO J=1, 50000             !Array should be computed first for 50,000 elements (can't be changed)
                        DO I=1, 5
                           A(I,J)=A(I,J)+SQRT(B(J))   
                        END DO
                    END DO
                 END DO
       !Computing computer's running time (finish)
           CALL CPU_TIME(TIME2)
             PRINT *, 'Elapsed real time = ', TIME2-TIME1, 'second(s)'
END PROGRAM PROG_B

As you can see the different is for the first one I used 2D array A(50000,5) and for the second one I used 2D array A(5,50000).

To my knowledge, since Fortran is based on "column major", so the second case would be faster than the first one, since I performed (in the second one) the looping for the most inner side of array (in this case i=1, ..., 5).

But after compiled on gfortran (with -O3 optimization), I've found that the second one is even much slower than the first one. Here is the result:

First case : elapsed time = 29.187 s
Second case : elapsed time = 70.496 s

Could anyone explain me why?

PS: The results of both cases are same for sure.

回答1:

A part of the optimization is based on heuristics chosen by the implementation. You are not going to get a totally deterministic explanation. When writing programs in languages that allows full array operations like Fortran, the main point is: if you can express something as a full array / or vector operation, do it and let the compiler generates the loops for you. That way, the only thing to do is to test with your compiler the most efficient and stick with it.

And remember, it is compiler dependent so it will change with the compiler. For example, if I take your programs as is (changing the number of k iterations from 100000 to 10000 to speed it, and averaging 5 runs for each), this is the timing on my computer with different compilers. I don't have the time to check with new versions of compiler for you.

pgf90 14
A: 5.05 s
B: 0.18 s

gfortran 4.9
A: 1.05 s
B: 4.02 s

ifort 13
A: 2.01 s
B: 1.08 s

You can see that where gfortran tells you that B is bad, pgfortran tells the opposite and totally blows the results from A

Now if I vectorize to let the compiler do the job. Here I modify only A and eliminate the I loop to have this:

DO K=1, 10000
    DO J=1, 5
        A(I,J)=A(I,J)+SQRT(B(I))
    END DO
END DO

Then we get this (only program A)

pgf90 14: 5.05 s

gfortran 4.9: 5.04

ifort 13: 2.02 s

pgfortran and ifort are stable while gfortran is exploiting a trick in the first case, possibly the suggestion of haraldkl (see the factor 5). When we vectorize, the trick is not obvious gfortran does not perform well. It seems that ifort and pgfortran simply rewrite the loop for you to have the right ordering.

And if I get smarter and elimite the K-loop too

    DO J=1, 5
        A(:,J)=A(:,J)+10000*SQRT(B(:)) ! this seems to be final result
    END DO

Then we get this (only program A)

pgf90 14: 0.001

gfortran 4.9: 0.001

ifort 13: 0.001 s

All the compilers become equivalent because there is almost nothing to optimize. You see that you can optimize everything be simply using array operations.

Update High Performance Mark pointed out in comment that the compiler might actually skip all the computation if it found that the result is not used, which might happen with some implementations. The results presented in this answer accounted for that possibility, even though I did not mention it in the first place. To prevent the compiler from skipping the code entirely, I printed the sum of all the elements of the result after the computation (after the timing); The result is identical to the 3rd digit after the decimal, which is good enough for a result of ~372684326034.9146 (~10^12). This is enough to ensure that the compiler does the actual computation. I totally forgot to mention it in the answer.

回答2:

The compiler probably does something like this:

    DO K=1, 100000
       DO I=1, 50000
          tmp = sqrt(b(i))
          A(I,1) = A(I,1) + tmp
          A(I,2) = A(I,2) + tmp
          A(I,3) = A(I,3) + tmp
          A(I,4) = A(I,4) + tmp
          A(I,5) = A(I,5) + tmp
       END DO
    END DO

In Prog_A this gives you a nice access pattern with a stride of 1. If you change the order of the indices as in Prog_B, you will get a stride of 5 for this code. The effect of this is machine dependent, but is definitely worse than the simple stride-1 access.