I have a often called function that is highly suited for parallel processing, so i investigated C++ amp for starters. The function accepts three inputs:
- a vector of floats, which is the input data
- a vector of constant coefficients, which stays the same throughout calls
- an output vector, where the result is written to.
Now obviously, #1 has to be copied onto the GPU each call. For this, I'm using a stack-managed const array<>, which works fine.
For #2, the optimal case would be to somehow keep the vector in the GPU memory since it is constant. Is this possible using amp? Or do i have to copy it every time i make the call to parallel_for_each, similar to #1?
For #3, is it possible to allocate the buffer on the GPU and copy it back, instead of making an empty buffer on the cpu stack, copy it, and copy it back once the results are written to it?
A last thing, since the parallel_for_each call in nature is async - and will be synchronized by either the destructor of #3 or array_view::synchronize(), is it possible to leave the current function (and stackspace), do some other stuff meanwhile the GPU is processing, and then 'syncing' up at a later point?
It would require a dynamically allocated array_view to avoid the synchronize() on destruction, but the function wont seem to compile when i use pointers instead of stack-managed objects:
error C3581: unsupported type in amp restricted code
pointer or reference is not allowed as pointed to type, array element type or data member type (except reference to concurrency::array/texture)
Also, for those who are experienced in other architechtures like OpenCL, would i have better luck there?
1 - Yes. If you pass a const array_view
as the input it will not be copied back into host memory.
std::vector<float> cpu_data(20000000, 0.0f);
array_view<const float, 1> cpu_data_view(cpu_data.size(), cpu_data);
2 - Depending how large your coefficient array is you could do one of several things;
a - Store it in a local array within your parallel_for_each
lambda. This is convenient but will use up (precious) local memory so is only realistic if the array is very small.
array<float, 1> gpu_data(400);
std::vector<float> cpu_data(gpu_data.extent.size(), 1.0f);
copy(cpu_data.begin(), gpu_data);
In this case gpu_data will be available to all AMP code provided the lambda captures it.
b - Create an array
and explicitly copy your constant data into it before executing any AMP code.
c - Consider loading it into tile_static
memory if it is accessed many times by each thread.
3 - You can still use an array_view
to hold your output data but calling discard_data
on it prior to executing the parallel_for_each
will prevent the needless copy to GPU memory.
std::vector<float> cpu_output_data(20000000, 0.0f);
array_view<float, 1> output_data_view(cpu_output_data.size(), cpu_output_data);
output_data_view.discard_data();
**Async - ** Yes it is completely possible to do this. You can combine AMP with C++ futures and async operations to execute other work on the CPU (or another GPU) concurrently. Remember that the CPU is involved in controlling the scheduling of work on the GPU and moving data to and from it. So, if you overload the CPU then GPU performance may suffer.
WRT to your compiler error it's hard to tell what the issue is without seeing the code. It's totally OK to do the following:
std::unique_ptr<concurrency::array_view<int, 2>> data_view;
You might want to have a look at the examples covered in the C++ AMP book. They are available on CodePlex and cover a lot of these scenarios.