In my project, I have implemented a custom memory allocator to avoid unneccessary calls to cudaMalloc
once the application has "warmed up". Moreover, I use custom kernels for basic array filling, arithmetic operations between arrays, etc. and would like to simplify my code by using Thrust
and getting rid of these kernels. Every array on the device is created and accessed through raw pointers (for now) and I'd like to use device_vector
and Thrust
s methods on these objects, but I find myself converting between raw pointers and device_ptr<>
all the time, somewhat cluttering up my code.
My rather vague question: How would/do you organize the usage of custom memory management, Thrust
s array methods and calls to custom kernels in the most readable way?
Like all standard c++ containers, you can customize how thrust::device_vector
allocates storage by providing it with your own "allocator". By default, thrust::device_vector
's allocator is thrust::device_malloc_allocator
, which allocates (deallocates) storage with cudaMalloc
(cudaFree
) when Thrust's backend system is CUDA.
Occasionally, it is desirable to customize the way device_vector
allocates memory, such as in the OP's case, who would like to sub-allocate storage within a single large allocation performed at program initialization. This can avoid overhead which may be incurred by many individual calls to the underlying allocation scheme, in this case, cudaMalloc
.
A simple way to provide device_vector
a custom allocator is to inherit from device_malloc_allocator
. One could in principle author an entire allocator from scratch, but with an inheritance approach, only the allocate
and deallocate
member functions need to be provided. Once the custom allocator is defined, it can be provided to device_vector
as its second template parameter.
This example code demonstrates how to provide a custom allocator which prints a message upon allocation and deallocation:
#include <thrust/device_malloc_allocator.h>
#include <thrust/device_vector.h>
#include <iostream>
template<typename T>
struct my_allocator : thrust::device_malloc_allocator<T>
{
// shorthand for the name of the base class
typedef thrust::device_malloc_allocator<T> super_t;
// get access to some of the base class's typedefs
// note that because we inherited from device_malloc_allocator,
// pointer is actually thrust::device_ptr<T>
typedef typename super_t::pointer pointer;
typedef typename super_t::size_type size_type;
// customize allocate
pointer allocate(size_type n)
{
std::cout << "my_allocator::allocate(): Hello, world!" << std::endl;
// defer to the base class to allocate storage for n elements of type T
// in practice, you'd do something more interesting here
return super_t::allocate(n);
}
// customize deallocate
void deallocate(pointer p, size_type n)
{
std::cout << "my_allocator::deallocate(): Hello, world!" << std::endl;
// defer to the base class to deallocate n elements of type T at address p
// in practice, you'd do something more interesting here
super_t::deallocate(p,n);
}
};
int main()
{
// create a device_vector which uses my_allocator
thrust::device_vector<int, my_allocator<int> > vec;
// create 10 ints
vec.resize(10, 13);
return 0;
}
Here's the output:
$ nvcc my_allocator_test.cu -arch=sm_20 -run
my_allocator::allocate(): Hello, world!
my_allocator::deallocate(): Hello, world!
In this example, note that we hear from my_allocator::allocate()
once upon vec.resize(10,13)
. my_allocator::deallocate()
is invoked once when vec
goes out of scope as it destroys its elements.