I am new to OpenCL.
I would like to write a generic kernel so later I can extend its use to other memory non-coalescing patterns and pairing this with Rectangular stencil pattern
for simplicity (also avoiding out-of-bound access).
This kernel controls the use of local memory (__local float ∗lmem
As of now, I have structures my .cl
file as bellow:
__kernel void kmain (
__global float ∗in ,
__global float ∗out ,
__global float ∗in2 ,
__local float ∗lmem)
int wg_x = get group id(0);
int wg_y = get group id(1);
int wi_x = get local id(0);
int wi_y = get local id(1);
// number of work units each work-item processes
for (int iter_x = 0; iter_x< NUM_WUS_X-1, iter_x++ ) {
for (int iter_y = 0; iter_y< NUM_WUS_Y-1; iter_x++) {
int wu_x, wu_y;
// The current work unit coordinate (wu_x, wu_y) is computed based on work group ID (wg_x, wg_y), work item ID (wi_x, wi_y) and work unit ID (iter_x, iter_y) :
(wu_x, wu_y) = func(wg_x, wg_y
wi_x, wi_y,
iter_x ,iter_y);
// This is where to cooperatively load
// a region of <in> to the local memory.
// barrier (...);
for (int i = 0; i < N-1, i++) {
for (int j = 0; j< M-1, j++) {
// (fo, fi) detemines the home access pattern centered around (idx_o, idx_i). WI(*,*) defines the memory access pattern i.e: (wi_x) = (wi_y) :
int idx_o = fo(wu_x, wu_y, i, j);
int idx_i = fi(wu_x, wu_y, i, j);
// offsets CO's and CI's determine stencil pattern within each work-item
... = in[idx_o + CO_1][idx_i + CI_1];
... // context (inner loop body)
... = in[idx_o + CO_k][idx_i + CI_k];
... // context (inner loop body)
// barrier (...);
... // context (epilogue)
out[y][x] = ...;
Does anyone have any idea on implementing this pattern withits corresponding generic host?
You can develop a host-side encapsulation over OpenCL bindings such that,
operator for different access patterns for different memory types and data patternsthen you can simply change a flag in parameter list, to test it against local vs global memory performance or give it different kernel string but this won't be easier than simply writing different cl files. Looks like too much work if you have just a few different implementations. Also debugging gets harder when you don't know result kernel string.
Sorry for late response.