Efficiently Initializing Shared Memory Array in CU

2019-09-10 12:59发布

Note that this shared memory array is never written to, only read from.

As I have it, my shared memory gets initialized like:

__shared__ float TMshared[2592]; 
for (int i = 0; i< 2592; i++)
{
TMshared[i] = TM[i];

}
__syncthreads();

(TM is passed into all threads from kernel launch)

You might have noticed that this is highly inefficient as there is no parallelization going on and threads within the same block are writing to the same location.

Can someone please recommend a more efficient approach/comment on if this issue really needs optimization since the shared array in question is relatively small?

Thanks!

标签： c++ memory cuda shared

1条回答

beautiful°

2楼-- · 2019-09-10 13:24

Use all threads to write independent locations, it will probably be quicker.

Example assumes 1D threadblock/grid:

#define SSIZE 2592

__shared__ float TMshared[SSIZE]; 

  int lidx = threadIdx.x;
  while (lidx < SSIZE){
    TMShared[lidx] = TM[lidx];
    lidx += blockDim.x;}

__syncthreads();

0人赞添加讨论(0) 举报

Efficiently Initializing Shared Memory Array in CU

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间