Data sharing between CPU and GPU on modern x86 har

2020-07-14 05:31发布

问题:

Progressing unification of CPU and GPU hardware, as evidenced by AMD Kaveri with hUMA (heterogeneous Uniform Memory Access) and Intel 4th generation CPUs, should allow copy-free sharing of data between CPU and GPU. I would like to know, if the most recent OpenCL (or other GPGPU framework) implementations allow true copy-free sharing (no explicit or implicit data copying) of large data structure between code running on CPU and GPU.

回答1:

The ability to share data between host and device without any memory transfers has been available in OpenCL from version 1.0, via the CL_MEM_ALLOC_HOST_PTR flag. This flag allocates a buffer for the device, but ensures that it lies in memory that is also accessible by the host. The workflow for these 'zero-copy' transfers usually takes on this form:

// Allocate a device buffer using host-accessible memory
d_buffer = clCreateBuffer(context, CL_MEM_ALLOC_HOST_PTR, size, NULL, &err);

// Get a host-pointer for the buffer
h_buffer = clEnqueueMapBuffer(queue, d_buffer, CL_TRUE, CL_MAP_WRITE,
                              0, size, 0, NULL, &err);

// Write data into h_buffer from the host
... 

// Unmap the memory buffer
clEnqueueUnmapMemObject(queue, d_buffer, h_buffer, 0, NULL, NULL);

// Do stuff with the buffer on the device
clSetKernelArg(kernel, 0, sizeof(cl_mem), &d_buffer);
clEnqueueNDRangeKernel(queue, kernel, ...);

This will create a device buffer, write some data into it from the host, and then run a kernel using this buffer on the device. Because of the way that the buffer was allocated, this should not result in a memory transfer if the device and host have a unified memory system.


The above approach is limited to simple, flat data structures (1D arrays). If you are interested in working with something a little more complex such as linked-lists, trees or any other pointer-based data structures, you'll need to take advantage of the Shared Virtual Memory (SVM) feature in OpenCL 2.0. At the time of writing, AMD and Intel have both released some preview support for OpenCL 2.0 functionality, but I cannot vouch for their implementations of SVM.

The workflow for the SVM approach will be somewhat similar to the code listed above. In short, you will allocate a buffer using clSVMAlloc, which will return a pointer that is valid on both the host and device. You will use clEnqueueSVMMap and clEnqueueSVMUnmap to synchronise the data when you wish to access the buffer from the host, and clSetKernelArgSVMPointer to pass it to the device. The crucial difference between SVM and CL_MEM_ALLOC_HOST_PTR is that an SVM pointer can also be included inside another buffer passed to the device (e.g. inside a struct or pointed to by another pointer). This is what allows you to build complex pointer-based data structures that can be shared between the host and device.



标签: opencl gpgpu