Hopefully you could me explaining this thing. I'm working with Fortran and actually writing my code on CFD topic, and below (just for sake of simplicity and just for example) are short explanations of my case:
- I should use one 2D array A(i,j) and one 1D array B(i)
- I have to do 2 times looping, which is the first looping should be 50,000 times and the second one is 5 times (can't be changed).
- The point number 2 above should be looped 10,000 times.
I write the codes with 2 versions (I called them Prog_A and Prog_B).
The first one is as shown below:
PROGRAM PROG_A
REAL*8, DIMENSION(50000,5):: A
REAL*8, DIMENSION(50000)::B
REAL*8:: TIME1,TIME2
!Just for initial value
DO I=1, 50000
A(I,1)=1.0
A(I,2)=2.0
A(I,3)=3.0
A(I,4)=4.0
A(I,5)=5.0
B(I)=I
END DO
!Computing computer's running time (start)
CALL CPU_TIME(TIME1)
DO K=1, 100000
DO I=1, 50000 !Array should be computed first for 50,000 elements (can't be changed)
DO J=1, 5
A(I,J)=A(I,J)+SQRT(B(I))
END DO
END DO
END DO
!Computing computer's running time (finish)
CALL CPU_TIME(TIME2)
PRINT *, 'Elapsed real time = ', TIME2-TIME1, 'second(s)'
END PROGRAM PROG_A
The second one is:
PROGRAM PROG_B
REAL*8, DIMENSION(5,50000):: A
REAL*8, DIMENSION(50000)::B
REAL*8:: TIME1,TIME2
!Just for initial value
DO J=1, 50000
A(1,J)=1.0
A(2,J)=2.0
A(3,J)=3.0
A(4,J)=4.0
A(5,J)=5.0
B(J)=J
END DO
!Computing computer's running time (start)
CALL CPU_TIME(TIME1)
DO K=1, 100000
DO J=1, 50000 !Array should be computed first for 50,000 elements (can't be changed)
DO I=1, 5
A(I,J)=A(I,J)+SQRT(B(J))
END DO
END DO
END DO
!Computing computer's running time (finish)
CALL CPU_TIME(TIME2)
PRINT *, 'Elapsed real time = ', TIME2-TIME1, 'second(s)'
END PROGRAM PROG_B
As you can see the different is for the first one I used 2D array A(50000,5) and for the second one I used 2D array A(5,50000).
To my knowledge, since Fortran is based on "column major", so the second case would be faster than the first one, since I performed (in the second one) the looping for the most inner side of array (in this case i=1, ..., 5).
But after compiled on gfortran (with -O3 optimization), I've found that the second one is even much slower than the first one. Here is the result:
- First case : elapsed time = 29.187 s
- Second case : elapsed time = 70.496 s
Could anyone explain me why?
PS: The results of both cases are same for sure.