I am using Tesla C2050, which has a compute capability 2.0 and has shared memory 48KB
. BUt when I try to use this shared memory the nvcc
compiler gives me the following error
Entry function '_Z4SAT3PhPdii' uses too much shared data (0x8020 bytes + 0x10 bytes system, 0x4000 max)
My SAT1 is the naive implementation of scan algorithm, and because I am operating on images sizes of the order 4096x2160
I have to use double to calculate the cumulative sum. Though Tesla C2050
does not support double, but it nevertheless does the task by demoting it to float. But for an image width of 4096 the shared memory size comes out to be greater 16KB but it is well within the 48KB limit.
Can anybody help me understand what is happening here. I am using CUDA toolkit 3.0
By default, Fermi cards run in a compatibility mode, with 16kb shared memory and 48kb L1 cache per multiprocessor. The API call
cudaThreadSetCacheConfig
can be used to change the GPU to run with 48kb shared memory and 16kb L1 cache, if you require it. You then must compile the code for compute capability 2.0 to avoid the code generation error you are seeing.Also, your Telsa C2050 does support double precision. If you are getting compiler warnings about demoting doubles, it means you are not compiling your code for the correct architecture. Add
to your
nvcc
arguments and the GPU toolchain will compile for your Fermi card, and will include double precision support and other Fermi specific hardware features, including larger shared memory size.As far as I know Cuda 3.0 supports compute 2.0. I use VS 2010 with CUDA 4.1 . So I am assuming VS 2008 should be also somewhat similar. Right click on the project and select properties-> Cuda C/C++ -> Device ->Code generation. Change it to compute_10,sm_10;compute_20,sm_20