I have a loop of this structure
Reference : Maxwell Code Example
do z=1,zend
do y=1,yend
do x=1,xend
k=arr(x,y,z)
do while(k.ne.0)
ix=fooX(k)
iy=fooY(k)
iz=fooZ(k)
x1=x(ix ,iy ,iz)
x2=x(ix+1,iy ,iz)
x3=x(ix ,iy+1,iz)
x4=x(ix+1,iy+1,iz)
x5=x(ix ,iy ,iz+1)
x6=x(ix+1,iy ,iz+1)
x7=x(ix ,iy+1,iz+1)
x8=x(ix+1,iy+1,iz+1)
y1=y(ix ,iy ,iz)
y2=y(ix+1,iy ,iz)
y3=y(ix ,iy+1,iz)
y4=y(ix+1,iy+1,iz)
y5=y(ix ,iy ,iz+1)
y6=y(ix+1,iy ,iz+1)
y7=y(ix ,iy+1,iz+1)
y8=y(ix+1,iy+1,iz+1)
z1=z(ix ,iy ,iz)
z2=z(ix+1,iy ,iz)
z3=z(ix ,iy+1,iz)
z4=z(ix+1,iy+1,iz)
z5=z(ix ,iy ,iz+1)
z6=z(ix+1,iy ,iz+1)
z7=z(ix ,iy+1,iz+1)
z8=z(ix+1,iy+1,iz+1)
sumX+=x1+x2+..x8
sumY+=y1+y2+..y8
sumZ+=z1+z2+..z8
k=linkArr(k)
enddo
enddo
enddo
enddo
x1 through x8 are the 8 corners of a rectangular cuboid. There are three challenges to vectorize this code. One is that the 8 array elements are not contiguous in memory. Second is the inherent while loop structure along with linked List access. Third the values of ix, iy, iz returned from from fooX, fooY, fooZ are not not contiguous. So each iteration of the loop has a completely different set of ix, iy, iz. So the even across the iterations the memory access is scattered. I tried the following approaches: 1. unrolled the 3-level DO loops as :
do z=1,zend
do y=1,yend
do x=1,xend
if(arr(x,y,z).NE.0) then
kArr(indx)=arr(x,y,z)
DO WHILE (kArr(indx).NE.0)
indx = indx + 1
kArr(indx)=linkArr(kArr(indx-1))
ENDDO
endif
enddo
enddo
enddo
With this i have got rid of the while loop structure and now I'm able to run one big loop on kArr inside which i group 8 elements (say my VPU can accomodate 8 sets of data at a time). It did not give a performance improvement. I can post the details of these if anyone is interested. I need suggestions on how to optimize this code. Another option i tried was to combine x,y,z data in a single array so that when i compute x1, y1 & z1 also will be in adjacent memory locations.