I have implemented a string matching algorithm on the GPU. The searching time of a parallel version has been decreased considerably compared with the sequential version of the algorithm, but by using different number of blocks and threads I get different results. How can I determine the number of the blocks and threds to get the best results?
相关问题
- Achieving the equivalent of a variable-length (loc
- The behavior of __CUDA_ARCH__ macro
- multidplyr : assign functions to cluster
- Setting Nsight to run with existing Makefile proje
- Usage of anonymous functions in arrayfun with GPU
相关文章
- How to use doMC under Windows or alternative paral
- How to downgrade to cuda 10.0 in arch linux?
- Parallel while loop in R
- Does gfortran take advantage of DO CONCURRENT?
- Using Parallel Linq Extensions to union two sequen
- MPI and D: Linker Options
- Is there an existing solution for these particular
- SqlConnection closes unexpectedly inside using sta
I think this question is hard, if not impossible, to answer for the reason that it really depends on the algorithm and how it is operating. Since i cant see your implementation i can give you some leads:
Don't use global memory & check how you can max out the use of shared memory. Generally get a good feel of how threads access memory and how data is retrieved etc.
Understand how your warps operate. Sometimes threads in a warp may wait for other threads to finish in case you have 1 to 1 mapping between thread and data. So instead of this 1 to 1 mapping, you can map threads to multiple data so that they are kept busy.
Since blocks consist of threads that are group in 32 threads warp, it is the best if the number of threads in a block is a multiple of 32, so that you dont get warps consisting of 3 threads etc.
Avoid Diverging paths in warps.
I hope it helps a bit.
@Chris points are very important too but depend more on the algorithm itself.
Check the Cuda Manual about Thread alignment regarding memory lookups. Shared Memory Arrays should also be size of multiple of 16.
Use Coalesced global memory reads. But by algorithm design this is often the case and using shared memory helps.
Don't use atomic operations in global memory or at all if possible. They are very slow. Some algorithms using atomic operations can be rewritten using different techniques.
Without shown code no-one can tell you what is the best or why performance changes.
The number of threads per block of your kernel is the most important value.
Important values to calculate that value are:
Your algorithms should be scalable across all GPU's reaching 100% occupancy. For this I created myself a helper class which automatically detects the best thread numbers for the used GPU and passes it to the Kernel as a DEFINE.
Example Cuda 1.1 to reach 786 resident Threads per SM
Good programming advices:
Example: Using Cuda 1.1 and using optimal number of 768 resident threads per SM you have 8192 registers to use. This leads to 8192 / 768 = 10 maximum registers per thread/kernel. If you use 11 the GPU will use 1 Block less resulting in decreased performance.
Example: A matrix independent row vector normalizing kernel of mine.