Generic OpenCL stencil kernel and host

2019-06-22 11:42发布

问题:

I am new to OpenCL.

I would like to write a generic kernel so later I can extend its use to other memory non-coalescing patterns and pairing this with Rectangular stencil pattern for simplicity (also avoiding out-of-bound access).

This kernel controls the use of local memory (__local float ∗lmem).

As of now, I have structures my .cl file as bellow:

__kernel void kmain (
    __global float ∗in ,
    __global float ∗out ,
    __global float ∗in2 ,
    __local float ∗lmem)
{
    int wg_x = get group id(0);
    int wg_y = get group id(1);
    int wi_x = get local id(0);
    int wi_y = get local id(1);     

    // number of work units each work-item processes
    for (int iter_x = 0; iter_x< NUM_WUS_X-1, iter_x++ ) {
        for (int iter_y = 0; iter_y< NUM_WUS_Y-1; iter_x++) {
            int wu_x, wu_y;

            // The current work unit coordinate (wu_x, wu_y) is computed based on work group ID (wg_x, wg_y), work item ID (wi_x, wi_y) and work unit ID (iter_x, iter_y) :
            (wu_x, wu_y) = func(wg_x, wg_y
                          wi_x, wi_y,
                          iter_x ,iter_y);

        // This is where to cooperatively load
        // a region of <in> to the local memory.
        // barrier (...);

            for (int i = 0; i < N-1, i++) {
                for (int j = 0; j< M-1, j++) {

                // (fo, fi) detemines the home access pattern centered around (idx_o, idx_i). WI(*,*) defines the memory access pattern i.e: (wi_x) = (wi_y) :
                int idx_o = fo(wu_x, wu_y, i, j);
                int idx_i = fi(wu_x, wu_y, i, j);

                // offsets CO's and CI's  determine stencil pattern within each work-item
                ... = in[idx_o + CO_1][idx_i + CI_1];
                ... // context (inner loop body)
                ... = in[idx_o + CO_k][idx_i + CI_k];
                ... // context (inner loop body)
            }
        }
        // barrier (...);
        ... // context (epilogue)
        out[y][x] = ...;
        }
    }
}

Does anyone have any idea on implementing this pattern withits corresponding generic host?

回答1:

You can develop a host-side encapsulation over OpenCL bindings such that,

  • it takes some general purpose code string from user
  • it generates a kernel at run-time, using user string, also depending on "strategy" you choose to re-shape the kernel
  • kernel string also defines a custom resource named "scratch_pad"
    • which has selectable memory type (local/global/register/constant)
    • which is automatically filled by necessary inputs
    • which is limited on max size by user
    • which has numerous implementations of [] operator for different access patterns for different memory types and data patterns
  • binds relevant host side buffers to kernel side buffers automatically

then you can simply change a flag in parameter list, to test it against local vs global memory performance or give it different kernel string but this won't be easier than simply writing different cl files. Looks like too much work if you have just a few different implementations. Also debugging gets harder when you don't know result kernel string.

Sorry for late response.