Can this OpenCL code be optimized?

I am working on a piece of OpencL code for a specialized matrix function: for a Dx1 vector v, two DxD matrices A and B and a constant c, return 1xD vector r where r[i] = c * sum_over_j (v[j] * A[i][j] * B[i][j])

Below is what I have so far, but it runs freakishly slow. A version without summing that returns a DxD matrix is about ten times faster. It's called from PyOpenCL if that makes any difference.

Is anything done wrong? Could it be optimized?

#define D 1000
...

   __kernel void element_mult(
      __global float *result,
      __global const float *vector,
      __global const float *matrix,
      __global const float *matrix2,
        const float factor)
      {
         int y = get_global_id(1);
         float sum = 0;
         for(int k = 0; k < D; k++)
         {
            sum += vector[k] * matrix[(y*D) + k]
            * matrix2[(y*D) + k ];
         }
         result[y] = sum * factor;
      }

Cheers!

Optimization #1: make vector __local.

My first pass at this got a decent improvement in performance. I noticed that each vector[k] is read a total of D times, so I copied it to a __local. This is only possible because D is small enough to allow this. The kernel as you have it above suffers from a terrible ALU:fetch ratio of 0.08 on both the 5870 and the 6970 gpus. Even the slower gpus are still waiting on the memory access.

   #define D 1000
    __kernel void element_mult(
    __global float *result,
    __global const float *vector,
    __global const float *matrix,
    __global const float *matrix2,
    const float factor)
    {
        int y = get_global_id(0);
        float sum = 0;

        __local float vectCopy[D];
        int ls = get_local_size(0);
        int lid = get_local_id(0);
        for(int i=0;i<D;i+=ls){
            vectCopy[i+lid] = vector[i+lid];
        }
        mem_fence(CLK_LOCAL_MEM_FENCE);

        for(int k = 0; k < D; k++)
        {
            sum += vectCopy[k] * matrix[(y*D) + k] * matrix2[(y*D) + k ];
        }
        result[y] = sum * factor;
    }

With this change, APP profiler is showing a new ALU:fetch ratio of 0.20 for the 5870 and 6970 gpus. Average times changed from 1513-->1034, and 1261-->861 on the same cards. The low end gpus are now bound by ALU instead of fetch. (greater than 4:1 ratio)

Opimization #2: calculate each result[y] using an entire work group.

You would have to do this id D were much larger (100k+). The idea is to get the best memory access pattern by using the work group to compute a single element of the result at a time. I defined ls (local size) to be 64 here, because it works on my hardware, as well as most vendors'. The workgroup size you use from the host-side will have to be 64 unless you change that definition. It needs to be defined to create the sum[ls] storage as __local, and I don't like passing variable sized __local vars into my kernels.

results: 5870 ALU:fetch=0.59:1, avg=708. 6970 ALU:fetch=0.72, avg=590. According to APP profiler, this is about twice as fast as your original listing.

#define D 1000
#define ls 64
__kernel void element_mult(
__global float *result,
__global const float *vector,
__global const float *matrix,
__global const float *matrix2,
const float factor)
{
    __local float vectCopy[D];
    int lid = get_local_id(0);
    for(int i=0;i<D;i+=ls){
        vectCopy[i+lid] = vector[i+lid];
    }
    mem_fence(CLK_LOCAL_MEM_FENCE);

    int ng = get_num_groups(0);
    int gid = get_group_id(0);
    int y, k;
    __local float sum[ls];
    for(y = gid; y < D; y+=ng){
        for(k = lid; k < D; k+=ls)
        {
            sum[lid] += vectCopy[k] * matrix[(y*D) + k] * matrix2[(y*D) + k ];
        }
        if(lid==0){
            result[y] = sum[0];
            for(k=1;k<ls;k++){
                result[y] += sum[k];
            }
            result[y] *= factor;
        }
        mem_fence(CLK_LOCAL_MEM_FENCE);
    }
}

EDIT: APP profiler = AMD APP KernelAnalyzer