Why is preferred work group size multiple part of

From what I understand, the preferred work group size is roughly dependent on the SIMD width of a compute device (for NVidia, this is the Warp size, on AMD the term is Wavefront).

Logically that would lead one to assume that the preferred work group size is device dependent, not kernel dependent. However, to query this property must be done relative to a particular kernel using CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE. Choosing a value which isn't a multiple of the underlying hardware device SIMD width would not completely load the hardware resulting in reduced performance, and should be regardless of what kernel is being executed.

My question is why is this not the case? Surely this design decision wasn't completely arbitrary. Is there some underlying implementation limitations, or are there cases where this property really should be a kernel property?

标签： opencl gpgpu

4条回答

该账号已被封号

2楼-- · 2019-07-19 05:46

After reading through section 6.7.2 of the OpenCL 1.2 specifications, I found that a kernel is allowed to provide compiler attributes which specify either required or recommended worksize hints using the __attribute__ keyword. This property can only be passed to the host if the preferred work group size multiple is a kernel property vs. a device property.

The theoretical best work-group size choice may be a device-specific property, but it won't necessarily work best for a specific kernel, or at all. For example, what works best may be a multiple of 2*CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE or something all-together.

0人赞添加讨论(0) 举报

太酷不给撩

3楼-- · 2019-07-19 05:47

The preferred work-group size multiple (PWGSM) is a kernel, rather than device, property, to account for vectorization.

Let's say that the hardware has 16-wide SIMD units. Then a fully scalar kernel could have a PWGSM of 16, assuming the compiler manages to do a full automatic vectorization; similarly, for a kernel that uses float4s all around the compiler could still be able to find way to coalesce work-items in groups of 4, and recommend a PWGSM of 4.

In practice the only compilers that do automatic vectorization (that I know of) are Intel's proprietary ICD, and the open source pocl. Everything else always just returns 1 (if on CPU) or the wavefront/warp width (on GPU).

0人赞添加讨论(0) 举报

成全新的幸福

4楼-- · 2019-07-19 06:01

Logically what you are telling is right, here you are only considering the data parallelism achieved by SIMD, the value of SIMD changes for different data types as well, one for char and another one for double And also you are forgetting the fact that the all the work-items share the memory resources in the work group through local memory. The local memory is not necessarily a multiple of SIMD capability of the underlying hardware and the underlying hardware has multiple local memories.

0人赞添加讨论(0) 举报

祖国的老花朵

5楼-- · 2019-07-19 06:05

The GPU does have many processors which do have a queue of task/jobs that should be calculated.

We call the tasks that wait for execution because they are blocked by an RAM access or which are not jet executed 'in flight'.

To answer your question, the numer of task in flight must be high enougth to compensate the waiting delay introduced by the accesses to the RAM of the Graphics card.

References: Thread 1

0人赞添加讨论(0) 举报

Why is preferred work group size multiple part of

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间