CUB (CUDA UnBound) equivalent of thrust::gather

Due to some performance issues with the Thrust libraries (see this page for more details), I am planning on re-factoring a CUDA application to use CUB instead of Thrust. Specifically, to replace the thrust::sort_by_key and thrust::inclusive_scan calls). In a particular point in my application I need to sort 3 arrays by key. This is how I did this with thrust:

thrust::sort_by_key(key_iter, key_iter + numKeys, indices);
thrust::gather_wrapper(indices, indices + numKeys, 
      thrust::make_zip_iterator(thrust::make_tuple(values1Ptr, values2Ptr, values3Ptr)),
      thrust::make_zip_iterator(thrust::make_tuple(valuesOut1Ptr, valuesOut2Ptr, valuesOut3Ptr))
);

where

key iter is a thrust::device_ptr that points to the keys i want to sort by
indices point to a sequence (from 0 to numKeys-1) in device memory
values{1,2,3}Ptr are device_ptrs to the values i want to sort
values{1,2,3}OutPtr are device_ptrs to the sorted values

With the CUB SortPairs function I can sort a single value buffer, but not all 3 in one shot. Problem is I don't see any CUB "gather-like" utilities. Suggestions?

EDIT:

I suppose I could implement my own gather kernel, but is there any better way to do this other than:

template <typename Index, typename Value> 
__global__ void  gather_kernel(const unsigned int N, const Index * map, 
const Value * src, Value * dst) 
{ 
    unsigned int i = blockDim.x * blockIdx.x + threadIdx.x; 
    if (i < N) 
    { 
        dst[i] = src[map[i]]; 
    } 
}

The non-coalesed loads and stores make me chringe, but it probably unavoidable without a known structure on map.

It seems what you want to achieve depends on thrust::zip_iterator. You could either

only replace thrust::sort_by_key by cub::DeviceRadixSort::SortPairs and keep thrust::gather, or
zip values{1,2,3} into array of structures before using cub::DeviceRadixSort::SortPairs

update

After reading the implementation of thrust::gather,

$CUDA_HOME/include/thrust/system/detail/generic/gather.inl

you can see it is only a naive kernel like

__global__ gather(int* index, float* in, float* out, int len) {
  int i=...;
  if (i<len) { out[i] = in[index[i]]; }
}

Then I think your code above can be replaced by a single kernel without too much effort.

In this kernel, you could first use the CUB block-wize primitive cub::BlockRadixSort<...>::SortBlockedToStriped to get the sorted indices stored in registers and then perform a naive re-order copy as thrust::gather to fill values{1,2,3}Out.

Using SortBlockedToStriped rather than Sort can do coalesced writing (not for reading though) when copying the values.