I'm trying to understand the architecture of OpenCL devices such as GPUs, and I fail to see why there is an explicit bound on the number of work items in a local work group, i.e. the constant CL_DEVICE_MAX_WORK_GROUP_SIZE.
It seems to me that this should be taken care of by the compiler, i.e. if a (one-dimensional for simplicity) kernel is executed with local workgroup size 500 while its physical maximum is 100, and the kernel looks for example like this:
__kernel void test(float* input) {
i = get_global_id(0);
someCode(i);
barrier();
moreCode(i);
barrier();
finalCode(i);
}
then it could be converted automatically to an execution with work group size 100 on this kernel:
__kernel void test(float* input) {
i = get_global_id(0);
someCode(5*i);
someCode(5*i+1);
someCode(5*i+2);
someCode(5*i+3);
someCode(5*i+4);
barrier();
moreCode(5*i);
moreCode(5*i+1);
moreCode(5*i+2);
moreCode(5*i+3);
moreCode(5*i+4);
barrier();
finalCode(5*i);
finalCode(5*i+1);
finalCode(5*i+2);
finalCode(5*i+3);
finalCode(5*i+4);
}
However, it seems that this is not done by default. Why not? Is there a way to make this process automated (other than writing a pre-compiler for it myself)? Or is there an intrinsic problem which can make my method fail on certain examples (and can you give me one)?