Is private memory slower than local memory?

2019-01-23 15:11发布

问题:

I was working on a kernel which had much global memory access per thread so I copied them to local memory which gave a speed up of 40%.

I wanted still more speed up so copied from local to private which degraded the performance

So is it correct that I think we must not use to much private memory which may degrade the performance?

回答1:

Ashwin's answer is in the right direction but a little misleading.

OpenCL abstracts the address space of variables away from their physical storage, and there is not necessarily a 1:1 mapping between the two.

Consider OpenCL variables declared in the __private address space, which includes automatic non-pointer variables inside functions by default. The NVidia GPU implementation will physically allocate these in registers as far as possible, only spilling over to physical off-chip memory when there is insufficient register capacity. This particular off-chip memory is called "CUDA local" memory, and has similar performance characteristics to memory allocated for __global variables, which explains the performance penalty due to register spill-over. There is no such physical thing as "private memory" in this implementation, only a "private address space", which may be allocated on- or off-chip.

The performance hit is not a direct consequence of using the private address space (or "private memory"), which is typically allocated in high performance memory. It is because, under this implementation, the variable was too large to be allocated on high performance registers, and was therefore "spilled over" to off-chip memory.



回答2:

(I know this is an old question, but the answers given aren't very accurate, and I saw conflicting answers elsewhere during Google searches.)

According to "Heterogeneous Computing with OpenCL" (Revised OpenCL 1.2 Edition):

Private memory is memory that is unique to an individual work-item. Local variables and nonpointer kernel arguments are private by default. In practice, these variables are usually mapped to registers, although private arrays and any spilled registers are usually mapped to an off-chip (i.e., long-latency) memory.

So, if you use a great deal of private memory, or use arrays in private memory, yes, it can be slower than local memory.



回答3:

In (GPU-like) OpenCL devices, the local memory is on-chip and close to the processing elements (PE). It might be as fast as accessing L1 cache. The private memory for each thread is actually apportioned from off-chip global memory. This is far from from the PE and might have a latency of hundreds of clock cycles, thus degrading the read-write performance.



回答4:

James Beilby's answer is the right direction but is a little bit out of the path:

Depending on implementation, it could be faster or slower because opencl doesn't force providers to use on-chip or off-chip memories but AMD is very good at OpenCL on price/performance dimension so I'll give some numbers about it.

Private memory in AMD implementation, is fastest(smallest latency,highest bandwidth like 22 TB/s for a mainstream gpu).

Here in appendix-d:

http://developer.amd.com/wordpress/media/2013/07/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide-rev-2.7.pdf

you can see register file, LDS, constant cache and global those are used for different name spaces when there is enough space for themselves. For example, register file has 22 TB/s and only about 300kB per compute unit. This has less latency and more bandwidth than LDS which is used for __local memory space. Total LDS size is even less than that (per compute unit).

If going from local to private doesnt do good, you should decrease local thread group size from 256 to 64 for example. SO more private registers availeable per thread.

So for this example AMD gpu, local memory is 15 times faster than global memory, private memory is 5 times faster than local memory. If it doesn't fit in private memory, it spills to global memory so only L1-L2 cache can help here. If data is not re-used much, no point of using private registers here. Just stream from global to global if only used once.

For some smartphone or a cpu, it could be very bad to use private registers because they could be mapped to something else.



标签: opencl