I am new to OpenCL.
I would like to write a generic kernel so later I can extend its use to other memory non-coalescing patterns and pairing this with Rectangular stencil pattern
for simplicity (also avoiding out-of-bound access).
This kernel controls the use of local memory (__local float ∗lmem
).
As of now, I have structures my .cl
file as bellow:
__kernel void kmain (
__global float ∗in ,
__global float ∗out ,
__global float ∗in2 ,
__local float ∗lmem)
{
int wg_x = get group id(0);
int wg_y = get group id(1);
int wi_x = get local id(0);
int wi_y = get local id(1);
// number of work units each work-item processes
for (int iter_x = 0; iter_x< NUM_WUS_X-1, iter_x++ ) {
for (int iter_y = 0; iter_y< NUM_WUS_Y-1; iter_x++) {
int wu_x, wu_y;
// The current work unit coordinate (wu_x, wu_y) is computed based on work group ID (wg_x, wg_y), work item ID (wi_x, wi_y) and work unit ID (iter_x, iter_y) :
(wu_x, wu_y) = func(wg_x, wg_y
wi_x, wi_y,
iter_x ,iter_y);
// This is where to cooperatively load
// a region of <in> to the local memory.
// barrier (...);
for (int i = 0; i < N-1, i++) {
for (int j = 0; j< M-1, j++) {
// (fo, fi) detemines the home access pattern centered around (idx_o, idx_i). WI(*,*) defines the memory access pattern i.e: (wi_x) = (wi_y) :
int idx_o = fo(wu_x, wu_y, i, j);
int idx_i = fi(wu_x, wu_y, i, j);
// offsets CO's and CI's determine stencil pattern within each work-item
... = in[idx_o + CO_1][idx_i + CI_1];
... // context (inner loop body)
... = in[idx_o + CO_k][idx_i + CI_k];
... // context (inner loop body)
}
}
// barrier (...);
... // context (epilogue)
out[y][x] = ...;
}
}
}
Does anyone have any idea on implementing this pattern withits corresponding generic host?