Why is preferred work group size multiple part of

2019-07-19 05:01发布

From what I understand, the preferred work group size is roughly dependent on the SIMD width of a compute device (for NVidia, this is the Warp size, on AMD the term is Wavefront).

Logically that would lead one to assume that the preferred work group size is device dependent, not kernel dependent. However, to query this property must be done relative to a particular kernel using CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE. Choosing a value which isn't a multiple of the underlying hardware device SIMD width would not completely load the hardware resulting in reduced performance, and should be regardless of what kernel is being executed.

My question is why is this not the case? Surely this design decision wasn't completely arbitrary. Is there some underlying implementation limitations, or are there cases where this property really should be a kernel property?

标签: opencl gpgpu
4条回答
该账号已被封号
2楼-- · 2019-07-19 05:46

After reading through section 6.7.2 of the OpenCL 1.2 specifications, I found that a kernel is allowed to provide compiler attributes which specify either required or recommended worksize hints using the __attribute__ keyword. This property can only be passed to the host if the preferred work group size multiple is a kernel property vs. a device property.

The theoretical best work-group size choice may be a device-specific property, but it won't necessarily work best for a specific kernel, or at all. For example, what works best may be a multiple of 2*CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE or something all-together.

查看更多
太酷不给撩
3楼-- · 2019-07-19 05:47

The preferred work-group size multiple (PWGSM) is a kernel, rather than device, property, to account for vectorization.

Let's say that the hardware has 16-wide SIMD units. Then a fully scalar kernel could have a PWGSM of 16, assuming the compiler manages to do a full automatic vectorization; similarly, for a kernel that uses float4s all around the compiler could still be able to find way to coalesce work-items in groups of 4, and recommend a PWGSM of 4.

In practice the only compilers that do automatic vectorization (that I know of) are Intel's proprietary ICD, and the open source pocl. Everything else always just returns 1 (if on CPU) or the wavefront/warp width (on GPU).

查看更多
成全新的幸福
4楼-- · 2019-07-19 06:01

Logically what you are telling is right, here you are only considering the data parallelism achieved by SIMD, the value of SIMD changes for different data types as well, one for char and another one for double And also you are forgetting the fact that the all the work-items share the memory resources in the work group through local memory. The local memory is not necessarily a multiple of SIMD capability of the underlying hardware and the underlying hardware has multiple local memories.

查看更多
祖国的老花朵
5楼-- · 2019-07-19 06:05

The GPU does have many processors which do have a queue of task/jobs that should be calculated.

We call the tasks that wait for execution because they are blocked by an RAM access or which are not jet executed 'in flight'.

To answer your question, the numer of task in flight must be high enougth to compensate the waiting delay introduced by the accesses to the RAM of the Graphics card.

References: Thread 1

查看更多
登录 后发表回答