Parallel Bubble sort on GPU

2019-07-20 12:15发布

i am implementing the simple bubble sort algorithm using CUDA, and i have a question.
i perform the following code in order to swap 2 consecutive elements in the array:

if(a[threadIdx.x]>a[threadIdx.x + 1])
    Swap(a[threadIdx.x] , a[threadIdx.x + 1]);

note that the number of threads in the block is half the size of the array. Is this a good implementation? would threads in a single warp execute in parallel even if there is a branch? therefore it would actually take N iterations in order to sort the array?

also note that i know that there are better sorting algorithms that i could implement,and i can use Thrust, CUDPP, or a sample sorting algorithm from the SDK, but in my case, i just need a simple algorithm to implement.

2条回答
何必那么认真
2楼-- · 2019-07-20 13:09

I'm glad you realise that bubble sort on the GPU is likely to perform terribly badly! I'm struggling to figure out how to get sufficient parallelism without having to launch many kernels. Also, you may struggle to work out when you're done.

Anyway, to answer your specific question: yes, it's highly likely that you will have warp divergence in this case. However, given that the "else" branch is effectively empty, this will not slow you down. On average (until this list is sorted), roughly half the threads in a warp will take the "if" branch, the other threads will wait, and then when the "if" branch is complete, the warp threads can go back to being in-step. This is far from your biggest problem :)

查看更多
Root(大扎)
3楼-- · 2019-07-20 13:15

I assume:

  1. The array you want to sort is small (<100 elements)
  2. It is a part of some bigger GPU algorithm
  3. The array resides in shared memory space, or can be copied there

If any of these is not true, don't do bubble sort!

the number of threads in the block is half the size of the array. Is this a good implementation?

It is reasonable. When a divergent branch occurs in a warp, all threads execute all branches in a perfect sync, simply some threads have their flag "disabled" set. This way, every branch is executed only once. The only exception --- when no thread from a warp is taking a branch --- then the branch is simply skipped.

BUG!

In your code, however I see a problem. If you want one thread to operate on two elements of the array, make them handle it exclusively, that is:

if(a[2*threadIdx.x]>a[2*threadIdx.x + 1])
    Swap(a[2*threadIdx.x] , a[2*threadIdx.x + 1]);

Otherwise, if Swap is executed by two neighbouring threads, some values may dissapear and some other values may get duplicated in the array.

Another bug!

If your block is bigger than a warp size, remember to put __syncthreads() when needed. Even if your block is smaller (shouldn't be), you should check the __threadfence_block() to ensure that writes to shared memory are visible by other threads of the same block. Otherwise, the compiler may be too aggressive on optimisations and make your code invalid.

Another problem

If you fix the first bug, you will have 2-way bank conflict on your shared memory. It is not very important, but you might want to reorganise the data in your array to avoid them, e.g. have the consecutive elements in the following order:

[1, 3, 5, 7, 9, ..., 29, 31, 2, 4, 6, 8, ... , 30, 32]

This way, elements 1 and 2 belong to the same bank in shared memory.

查看更多
登录 后发表回答