vectorize a loop which accesses non-consecutive me

2019-05-10 00:38发布

问题:

I have a loop of this structure

Reference : Maxwell Code Example

do z=1,zend
    do y=1,yend
        do x=1,xend
            k=arr(x,y,z)
            do while(k.ne.0)
                ix=fooX(k)
                iy=fooY(k)
                iz=fooZ(k)
                x1=x(ix  ,iy  ,iz)
                x2=x(ix+1,iy  ,iz)
                x3=x(ix  ,iy+1,iz)
                x4=x(ix+1,iy+1,iz)
                x5=x(ix  ,iy  ,iz+1)
                x6=x(ix+1,iy  ,iz+1)
                x7=x(ix  ,iy+1,iz+1)
                x8=x(ix+1,iy+1,iz+1)

                y1=y(ix  ,iy  ,iz)
                y2=y(ix+1,iy  ,iz)
                y3=y(ix  ,iy+1,iz)
                y4=y(ix+1,iy+1,iz)
                y5=y(ix  ,iy  ,iz+1)
                y6=y(ix+1,iy  ,iz+1)
                y7=y(ix  ,iy+1,iz+1)
                y8=y(ix+1,iy+1,iz+1)

                z1=z(ix  ,iy  ,iz)
                z2=z(ix+1,iy  ,iz)
                z3=z(ix  ,iy+1,iz)
                z4=z(ix+1,iy+1,iz)
                z5=z(ix  ,iy  ,iz+1)
                z6=z(ix+1,iy  ,iz+1)
                z7=z(ix  ,iy+1,iz+1)
                z8=z(ix+1,iy+1,iz+1)
                sumX+=x1+x2+..x8
                sumY+=y1+y2+..y8
                sumZ+=z1+z2+..z8

                k=linkArr(k)
            enddo
        enddo
    enddo
enddo

x1 through x8 are the 8 corners of a rectangular cuboid. There are three challenges to vectorize this code. One is that the 8 array elements are not contiguous in memory. Second is the inherent while loop structure along with linked List access. Third the values of ix, iy, iz returned from from fooX, fooY, fooZ are not not contiguous. So each iteration of the loop has a completely different set of ix, iy, iz. So the even across the iterations the memory access is scattered. I tried the following approaches: 1. unrolled the 3-level DO loops as :

do z=1,zend
    do y=1,yend
        do x=1,xend  
           if(arr(x,y,z).NE.0) then
                kArr(indx)=arr(x,y,z)
                DO WHILE (kArr(indx).NE.0)
                  indx = indx + 1
                  kArr(indx)=linkArr(kArr(indx-1))
                ENDDO
            endif
        enddo
    enddo
enddo

With this i have got rid of the while loop structure and now I'm able to run one big loop on kArr inside which i group 8 elements (say my VPU can accomodate 8 sets of data at a time). It did not give a performance improvement. I can post the details of these if anyone is interested. I need suggestions on how to optimize this code. Another option i tried was to combine x,y,z data in a single array so that when i compute x1, y1 & z1 also will be in adjacent memory locations.

回答1:

That while loop is killing you. In a similar situation a few years back, I got a modest improvement in performance doing something like this:

! at top of your code, introduce:
integer :: special_index
integer :: ix(1000), iy(1000), iz(1000)  !promoting scalars to arrays.
                                         ! make as big as possibly needed.

! code as usual until you get to your loops, then

! first, make lookup table
special_index=0
do z=1,zend
  do y=1,yend
    do x=1,xend
      k=arr(x,y,z)
      do while(k.ne.0)
        special_index=special_index+1
        ix(special_index)=fooX(k)
        iy(special_index)=fooY(k)
        iz(special_index)=fooZ(k)
        k=linkArr(k)
      enddo
    enddo
  enddo
endoo
! and now we do the calculation, loop over lookup table:
do n=1,special_index
  x1=x(ix(n)  ,iy(n)  ,iz(n))
  x2=x(ix(n)+1,iy(n)  ,iz(n))
  x3=x(ix(n)  ,iy(n)+1,iz(n))
  etc.
enddo

Like I said, this helped me a few years back. Your mileage may vary. The first loop still won't vectorize, but the second one might, and it might give better performance.