When I am executing some cuda kernel, I noticed that for the many of my own cuda kernels, x64 build would cause failure, whereas Win32 would not.
I am very confused because the cuda source code are the same, and build is fine. It is just when x64 executes, it says it requests too much resource to launch. But shouldn't x64 allows more resources than Win32 in conceptually?
I normally like to use 1024 threads per block if it is possible. So to make x64 code work, I have to downsize the block to 256.
Any one has any idea?
Yes, it's possible. Presumably the issue you are talking about is a registers-per-thread issue.
In 32-bit mode, all pointers are 32-bits and require only one 32-bit register for storage on the GPU. With the exact same source code, those pointers will require 64-bits for storage and therefore will effectively require two 32-bit registers (and, as @njuffa points out below, certain other types can change their size as well, requiring double the registers.) The number of available 32-bit registers is a hardware limit that does not change whether compiling for 32-bit or 64-bit mode, but pointer storage will use twice as many registers in 64-bit mode.
Pointer arithmetic (or arithmetic involving any of the types that increase in size) may likewise be impacted, as some of it may need to be done using 64-bit arithmetic vs. 32-bit arithmetic.
If these registers-per-thread increases in 64-bit mode place your overall usage over the limit, then you will have to use one of a variety of methods to manage it. You've mentioned one already: reduce the number of threads. You can also investigate the
nvcc -maxrregcount ...
switch, and/or the launch bounds directive.