cub BlockRadixSort: how to deal with large tile si

When using cub::BlockRadixSort to do the sorting within a block, if the number of elements is too large, how do we deal with that? If we set a tile size to be too large, the shared memory for the temporary storage will soon not able to hold it. If we split it into multiple tiles, how do we post-process it after we sorted each tile?

Caveat: I am not a cub expert (far from it).
You might want to review this question/answer as I'm building on some of the work I did there.
Certainly if the problem size is large enough, then a device-wide sort would seem to be something you might want to consider. But your question seems focused on block sorting.

From my testing, cub doesn't really have requirements around where your original data is located, or where you place the temp storage. Therefore, one possible solution would be simply to place your temp storage in global memory. To analyze this, I created a code that has 3 different test cases:

Test a version of cub block sort with the temp storage in global memory.
Test the original version of cub block sort adapted from the example here
Test a version of cub block sort derived from my previous answer, where there is no copying of data to/from global memory, ie. it is assumed that the data is already resident "on-chip" i.e. in shared memory.

None of this is extensively tested, but since I am building on cub building blocks, and testing my results in the first two cases, hopefully I have not made any grievous errors. Here's the full test code, and I will make additional comments below:

$ cat t10.cu
#include <cub/cub.cuh>
#include <stdio.h>
#include <stdlib.h>
#include <thrust/sort.h>
#define nTPB 512
#define ELEMS_PER_THREAD 2
#define RANGE (nTPB*ELEMS_PER_THREAD)
#define DSIZE (nTPB*ELEMS_PER_THREAD)



#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
        } \
    } while (0)

using namespace cub;
// GLOBAL CUB BLOCK SORT KERNEL
// Specialize BlockRadixSort collective types
typedef BlockRadixSort<int, nTPB, ELEMS_PER_THREAD> my_block_sort;
__device__ int my_val[DSIZE];
__device__ typename my_block_sort::TempStorage sort_temp_stg;

// Block-sorting CUDA kernel (nTPB threads each owning ELEMS_PER THREAD integers)
__global__ void global_BlockSortKernel()
{
    // Collectively sort the keys
    my_block_sort(sort_temp_stg).Sort(*static_cast<int(*)[ELEMS_PER_THREAD]>(static_cast<void*>(my_val+(threadIdx.x*ELEMS_PER_THREAD))));

}

// ORIGINAL CUB BLOCK SORT KERNEL
template <int BLOCK_THREADS, int ITEMS_PER_THREAD>
__global__ void BlockSortKernel(int *d_in, int *d_out)
{
// Specialize BlockLoad, BlockStore, and BlockRadixSort collective types
  typedef cub::BlockLoad<int*, BLOCK_THREADS, ITEMS_PER_THREAD, BLOCK_LOAD_TRANSPOSE> BlockLoadT;
  typedef cub::BlockStore<int*, BLOCK_THREADS, ITEMS_PER_THREAD, BLOCK_STORE_TRANSPOSE> BlockStoreT;
  typedef cub::BlockRadixSort<int, BLOCK_THREADS, ITEMS_PER_THREAD> BlockRadixSortT;
// Allocate type-safe, repurposable shared memory for collectives
  __shared__ union {
    typename BlockLoadT::TempStorage load;
    typename BlockStoreT::TempStorage store;
    typename BlockRadixSortT::TempStorage sort;
    } temp_storage;
// Obtain this block's segment of consecutive keys (blocked across threads)
  int thread_keys[ITEMS_PER_THREAD];
  int block_offset = blockIdx.x * (BLOCK_THREADS * ITEMS_PER_THREAD);
  BlockLoadT(temp_storage.load).Load(d_in + block_offset, thread_keys);
  __syncthreads(); // Barrier for smem reuse
// Collectively sort the keys
  BlockRadixSortT(temp_storage.sort).Sort(thread_keys);
  __syncthreads(); // Barrier for smem reuse
// Store the sorted segment
  BlockStoreT(temp_storage.store).Store(d_out + block_offset, thread_keys);
}



// SHARED MEM CUB BLOCK SORT KERNEL
// Block-sorting CUDA kernel (nTPB threads each owning ELEMS_PER THREAD integers)
template <int BLOCK_THREADS, int ITEMS_PER_THREAD>
__global__ void shared_BlockSortKernel(int *d_out)
{
    __shared__ int my_val[BLOCK_THREADS*ITEMS_PER_THREAD];
    // Specialize BlockRadixSort collective types
    typedef BlockRadixSort<int, BLOCK_THREADS, ITEMS_PER_THREAD> my_block_sort;
    // Allocate shared memory for collectives
    __shared__ typename my_block_sort::TempStorage sort_temp_stg;

    // need to extend synthetic data for ELEMS_PER_THREAD > 1
    my_val[threadIdx.x*ITEMS_PER_THREAD]  = (threadIdx.x + 5); // synth data
    my_val[threadIdx.x*ITEMS_PER_THREAD+1]  = (threadIdx.x + BLOCK_THREADS + 5); // synth data
    __syncthreads();
//    printf("thread %d data = %d\n", threadIdx.x,  my_val[threadIdx.x*ITEMS_PER_THREAD]);

    // Collectively sort the keys
    my_block_sort(sort_temp_stg).Sort(*static_cast<int(*)[ITEMS_PER_THREAD]>(static_cast<void*>(my_val+(threadIdx.x*ITEMS_PER_THREAD))));
    __syncthreads();

//    printf("thread %d sorted data = %d\n", threadIdx.x,  my_val[threadIdx.x*ITEMS_PER_THREAD]);
    if (threadIdx.x == clock()){ // dummy to prevent compiler optimization
      d_out[threadIdx.x*ITEMS_PER_THREAD] = my_val[threadIdx.x*ITEMS_PER_THREAD];
      d_out[threadIdx.x*ITEMS_PER_THREAD+1] = my_val[threadIdx.x*ITEMS_PER_THREAD+1];}
}


int main(){
    int *h_data, *h_result;
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    h_data=(int *)malloc(DSIZE*sizeof(int));
    h_result=(int *)malloc(DSIZE*sizeof(int));
    if (h_data == 0) {printf("malloc fail\n"); return 1;}
    if (h_result == 0) {printf("malloc fail\n"); return 1;}
    for (int i = 0 ; i < DSIZE; i++) h_data[i] = rand()%RANGE;
    // first test sorting directly out of global memory
    global_BlockSortKernel<<<1,nTPB>>>(); //warm up run
    cudaDeviceSynchronize();
    cudaMemcpyToSymbol(my_val, h_data, DSIZE*sizeof(int));
    cudaCheckErrors("memcpy to symbol fail");
    cudaEventRecord(start);
    global_BlockSortKernel<<<1,nTPB>>>(); //timing run
    cudaEventRecord(stop);
    cudaDeviceSynchronize();
    cudaCheckErrors("cub 1 fail");
    cudaEventSynchronize(stop);
    float et;
    cudaEventElapsedTime(&et, start, stop);
    cudaMemcpyFromSymbol(h_result, my_val, DSIZE*sizeof(int));
    cudaCheckErrors("memcpy from symbol fail");
    if(!thrust::is_sorted(h_result, h_result+DSIZE)) { printf("sort 1 fail!\n"); return 1;}
    printf("global Elapsed time: %fms\n", et);
    printf("global Kkeys/s: %d\n", (int)(DSIZE/et));
    // now test original CUB block sort copying global to shared
    int *d_in, *d_out;
    cudaMalloc((void **)&d_in, DSIZE*sizeof(int));
    cudaMalloc((void **)&d_out, DSIZE*sizeof(int));
    cudaCheckErrors("cudaMalloc fail");
    BlockSortKernel<nTPB, ELEMS_PER_THREAD><<<1, nTPB>>>(d_in, d_out); // warm up run
    cudaMemcpy(d_in, h_data, DSIZE*sizeof(int), cudaMemcpyHostToDevice);
    cudaEventRecord(start);
    BlockSortKernel<nTPB, ELEMS_PER_THREAD><<<1, nTPB>>>(d_in, d_out); // timing run
    cudaEventRecord(stop);
    cudaDeviceSynchronize();
    cudaCheckErrors("cub 2 fail");
    cudaEventSynchronize(stop);
    cudaEventElapsedTime(&et, start, stop);
    cudaMemcpy(h_result, d_out, DSIZE*sizeof(int), cudaMemcpyDeviceToHost);
    cudaCheckErrors("cudaMemcpy D to H fail");
    if(!thrust::is_sorted(h_result, h_result+DSIZE)) { printf("sort 2 fail!\n"); return 1;}
    printf("CUB Elapsed time: %fms\n", et);
    printf("CUB Kkeys/s: %d\n", (int)(DSIZE/et));
    // now test shared memory-only version of block sort
    shared_BlockSortKernel<nTPB, ELEMS_PER_THREAD><<<1, nTPB>>>(d_out); // warm-up run
    cudaEventRecord(start);
    shared_BlockSortKernel<nTPB, ELEMS_PER_THREAD><<<1, nTPB>>>(d_out); // timing run
    cudaEventRecord(stop);
    cudaDeviceSynchronize();
    cudaCheckErrors("cub 3 fail");
    cudaEventSynchronize(stop);
    cudaEventElapsedTime(&et, start, stop);
    printf("shared Elapsed time: %fms\n", et);
    printf("shared Kkeys/s: %d\n", (int)(DSIZE/et));
    return 0;
}
$ nvcc -O3 -arch=sm_20 -o t10 t10.cu
$ ./t10
global Elapsed time: 0.236960ms
global Kkeys/s: 4321
CUB Elapsed time: 0.042816ms
CUB Kkeys/s: 23916
shared Elapsed time: 0.040192ms
shared Kkeys/s: 25477
$

For this test, I am using CUDA 6.0RC, cub v1.2.0 (which is pretty recent), RHEL5.5/gcc4.1.2, and a Quadro5000 GPU (cc2.0, 11SMs, approximately 40% slower than a GTX480). Here are some observations that occur to me:

The speed ratio of the original cub sort(2) to the global memory sort(1) is approximately 6:1, which is approximately the bandwidth ratio of shared memory (~1TB/s) to global memory (~150GB/s).
The original cub sort(2) has a throughput, that when scaled for the number of SMs (11), yielding 263MKeys/s, is a sizeable fraction of the best device-wide sort I have seen on this device (thrust sort, yielding ~480MKeys/s)
The shared-memory only sort is not much faster than the original cub sort which copies input/output from/to global memory, indicating the copy from global memory to the cub temp storage is not a large fraction of the overall processing time.

The 6:1 penalty is a large one to pay. So my recommendation would be, if possible to use a device-wide sort on problem sizes larger than what can be handled easily by cub block sorting. This allows you to tap into the expertise of some of the best GPU code writers for your sorting, and achieve throughputs much closer to what the device as a whole is capable of.

Note that so I could test under similar conditions, the problem size here (512 threads, 2 elements per thread) does not exceed what you can do in a CUB block sort. But it's not difficult to extend the data set size to larger values (say, 1024 elements per thread) that can be only handled (in this context, among these choices) using the first approach. If I do larger problem sizes like that, on my GPU I observe a throughput of around 6Mkeys/s for the global memory block sort on my cc2.0 device.