可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I'm trying to understand the architecture of OpenCL devices such as GPUs, and I fail to see why there is an explicit bound on the number of work items in a local work group, i.e. the constant CL_DEVICE_MAX_WORK_GROUP_SIZE.

It seems to me that this should be taken care of by the compiler, i.e. if a (one-dimensional for simplicity) kernel is executed with local workgroup size 500 while its physical maximum is 100, and the kernel looks for example like this:

__kernel void test(float* input) {
    i = get_global_id(0);
    someCode(i);
    barrier();
    moreCode(i);
    barrier();
    finalCode(i);
}

then it could be converted automatically to an execution with work group size 100 on this kernel:

__kernel void test(float* input) {
    i = get_global_id(0);
    someCode(5*i);
    someCode(5*i+1);
    someCode(5*i+2);
    someCode(5*i+3);
    someCode(5*i+4);
    barrier();
    moreCode(5*i);
    moreCode(5*i+1);
    moreCode(5*i+2);
    moreCode(5*i+3);
    moreCode(5*i+4);
    barrier();
    finalCode(5*i);
    finalCode(5*i+1);
    finalCode(5*i+2);
    finalCode(5*i+3);
    finalCode(5*i+4);
}

However, it seems that this is not done by default. Why not? Is there a way to make this process automated (other than writing a pre-compiler for it myself)? Or is there an intrinsic problem which can make my method fail on certain examples (and can you give me one)?

回答1:

I think that the origin of the CL_DEVICE_MAX_WORK_GROUP_SIZE lies in the underlying hardware implementation.

Multiple threads are running simultaneously on computing units and every one of them needs to keep state (for call, jmp, etc). Most implementations use a stack for this and if you look at the AMD Evergreen family their is an hardware limit for the number of stack entries that are available (every stack entry has subentries). Which in essence limits the number of threads every computing unit can handle simultaneously.

As for the compiler can do this to make it possible. It could work but understand that it would mean to recompile the kernel over again. Which isn't always possible. I can imagine situations where developers dump the compiled kernel for each platform in a binary format and ships it with their software just for "not so open-source" reasons.

回答2:

Those constants are queried from the device by the compiler in order to determine a suitable work group size at compile-time (where compiling of course refers to compiling the kernel). I might be getting you wrong, but it seems you're thinking of setting those values by yourself, which wouldn't be the case.

The responsibility is within your code to query the system capabilities to be prepared for whatever hardware it will run on.