Yet Another CUDA Texture Memory Thread. (Why shoul

2019-05-11 18:35发布

There are quite a few stackoverflow threads asking why a kernel using textures is not faster than one using global memory access. The answers and comments seem always a little bit esoteric to me.

The NVIDIA white paper on the Fermi architecture states black on white:

The Fermi architecture addresses this challenge by implementing a single unified memory request path for loads and stores, with an L1 cache per SM multiprocessor and unified L2 cache that services all operations (load, store and texture).

So why on earth should one expect any speed up from using texture memory on Fermi devices, since for every memory fetch (regardless wether it's bound to a texture or not) the same L2 cache is used. Actually for most cases direct access to global memory should be faster since it is also cached through L1 which a texture fetch isn't. This is also reported in a few related questions here on stackoverflow.

Can someone confirm this or show me what I'm missing?

3条回答
兄弟一词,经得起流年.
2楼-- · 2019-05-11 19:06

I would not disregard the usage of texture memory. See e.g. the paper 'Communication-Minimizing 2D Convolution in GPU Registers' (http://parlab.eecs.berkeley.edu/publication/899) where they are comparing different implementations of small 2D convolution and the strategy of loading from texture memory directly into registers seems to be a good way according to them.

查看更多
叼着烟拽天下
3楼-- · 2019-05-11 19:08

You are neglecting that each Streaming Multiprocessor has a texture cache (see the picture below illustrating a Streaming Multiprocessor for Fermi).

enter image description here

Texture cache has a different meaning than L1/L2 cache, since it is optimized for data locality. Data locality applies to all the cases when data concerning semantically (not physically) neighboring points of regular, Cartesian, 1D, 2D or 3D grids must be accessed. To better explain this concept, consider the following figure illustrating the stencil as involved in 2D or 3D finite difference calculations

enter image description here

Calculating finite differences at the red point involves accessing the data associated to the blue points. Now, these data aren't physical neighbors of the red points since they will not be physically stored consecutively in global memory when flattening the 2D or 3D array to 1D. However, they are semantical neighbors of the red points and texture memory is right good at caching these values. On the other side, L1/L2 caches are good when the same datum or its physical neighbors must be frequently accessed.

The other side of the medal is that texture cache as a higher latency as compared to L1/L2 cache, so, in some cases, not using texture may not lead to a significany worsening of the performance, just thanks to the L1/L2 caching mechanism. From this point of view, texture had top importance in the early CUDA architectures, when global memory reads were not cached. But, as demonstrated in Is 1D texture memory access faster than 1D global memory access?, texture memory for Fermi is worth to be used.

查看更多
该账号已被封号
4楼-- · 2019-05-11 19:19

If the data being read via texture is 2D or 3D, the block linear layout of CUDA arrays usually gives better reuse than pitch-linear layouts, because cache lines contain 2D or 3D blocks of data instead of rows.

But even for 1D data, it's possible for the texture cache to complement the other on-chip cache resources. If the kernel is only using global memory accesses with no texture loads, all of that memory traffic goes through the per-SM L1 cache. If some of the kernel's input data gets read through texture, the per-SM texture cache will relieve some pressure from the L1 and enable it to service memory traffic that would otherwise go to the L2.

When making these tradeoffs, it's important to pay attention to the decisions NVIDIA has made from one chip architecture to the next. The texture caches in Maxwell are shared with the L1, which makes reading from texture less desirable.

查看更多
登录 后发表回答