Currently I'm work with two gtx 650 . My program resembles in simple Clients/Server structure. I distribute the work threads on the two gpus. The Server thread need to gather the result vectors from client threads, so I need to copy the memory between the two gpu. Unfortunaly, the simple P2P program in cuda samples just doesn't work because my cards don't have TCC drivers. Spending two hours searching on google and SO, I can't find the answer.Some source says I should use cudaMemcpyPeer
, and some other source says I should use cudaMemcpy
with cudaMemcpyDefault
.Is there some simple way to get my work done other than copy to host then copy to device. I know it must have been documented somewhere, but I can't find it.Thank you for your help.
问题:
回答1:
Transferring data from one GPU to another will often require a "staging" through host memory. The exception to this is when the GPUs and the system topology support peer-to-peer (P2P) access and P2P has been explicitly enabled. In that case, data transfers can flow directly over the PCIE bus from one GPU to another.
In either case (with or without P2P being available/enabled) the typical cuda runtime API call would be cudaMemcpyPeer
/cudaMemcpyPeerAsync
as demonstrated in the cuda p2pBandwidthLatencyTest sample code.
On windows, one of the requirements of P2P is that both devices be supported by a driver in TCC mode. TCC mode is, for the most part, not an available option for GeForce GPUs (recently, an exception is made for GeForce Titan family GPUs using drivers and runtime available in the CUDA 7.5RC toolkit.)
Therefore, on Windows, these GPUs will not be able to take advantage of direct P2P transfers. Nevertheless, a nearly identical sequence can be used to transfer data. The CUDA runtime will detect the nature of the transfer, and perform an allocation "under the hood" to create a staging buffer. The transfer will then be completed in 2 parts: a transfer from the originating device to the staging buffer, and a transfer from the staging buffer to the destination device.
The following is a fully worked example showing how to transfer data from one GPU to another, while taking advantage of P2P access if it is available:
$ cat t850.cu
#include <stdio.h>
#include <math.h>
#define SRC_DEV 0
#define DST_DEV 1
#define DSIZE (8*1048576)
#define cudaCheckErrors(msg) \
do { \
cudaError_t __err = cudaGetLastError(); \
if (__err != cudaSuccess) { \
fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
msg, cudaGetErrorString(__err), \
__FILE__, __LINE__); \
fprintf(stderr, "*** FAILED - ABORTING\n"); \
exit(1); \
} \
} while (0)
int main(int argc, char *argv[]){
int disablePeer = 0;
if (argc > 1) disablePeer = 1;
int devcount;
cudaGetDeviceCount(&devcount);
cudaCheckErrors("cuda failure");
int srcdev = SRC_DEV;
int dstdev = DST_DEV;
if (devcount <= max(srcdev,dstdev)) {printf("not enough cuda devices for the requested operation\n"); return 1;}
int *d_s, *d_d, *h;
int dsize = DSIZE*sizeof(int);
h = (int *)malloc(dsize);
if (h == NULL) {printf("malloc fail\n"); return 1;}
for (int i = 0; i < DSIZE; i++) h[i] = i;
int canAccessPeer = 0;
if (!disablePeer) cudaDeviceCanAccessPeer(&canAccessPeer, srcdev, dstdev);
cudaSetDevice(srcdev);
cudaMalloc(&d_s, dsize);
cudaMemcpy(d_s, h, dsize, cudaMemcpyHostToDevice);
if (canAccessPeer) cudaDeviceEnablePeerAccess(dstdev,0);
cudaSetDevice(dstdev);
cudaMalloc(&d_d, dsize);
cudaMemset(d_d, 0, dsize);
if (canAccessPeer) cudaDeviceEnablePeerAccess(srcdev,0);
cudaCheckErrors("cudaMalloc/cudaMemset fail");
if (canAccessPeer) printf("Timing P2P transfer");
else printf("Timing ordinary transfer");
printf(" of %d bytes\n", dsize);
cudaEvent_t start, stop;
cudaEventCreate(&start); cudaEventCreate(&stop);
cudaEventRecord(start);
cudaMemcpyPeer(d_d, dstdev, d_s, srcdev, dsize);
cudaCheckErrors("cudaMemcpyPeer fail");
cudaEventRecord(stop);
cudaEventSynchronize(stop);
float et;
cudaEventElapsedTime(&et, start, stop);
cudaSetDevice(dstdev);
cudaMemcpy(h, d_d, dsize, cudaMemcpyDeviceToHost);
cudaCheckErrors("cudaMemcpy fail");
for (int i = 0; i < DSIZE; i++) if (h[i] != i) {printf("transfer failure\n"); return 1;}
printf("transfer took %fms\n", et);
return 0;
}
$ nvcc -arch=sm_20 -o t850 t850.cu
$ ./t850
Timing P2P transfer of 33554432 bytes
transfer took 5.135680ms
$ ./t850 disable
Timing ordinary transfer of 33554432 bytes
transfer took 7.274336ms
$
Notes:
- Passing any command line parameter will disable the use of P2P even if it is available.
- The above results are for a system where P2P access is possible, and both GPUs are connected via a PCIE Gen2 link, capable of about 6GB/s transfer bandwidth in a single direction. The P2P transfer time is consistent with this (32MB/5ms ~= 6GB/s). The non-P2P transfer time is longer, but not double. This is due to the fact that for transfers to/from the staging buffer, after some data is transferred into the staging buffer, the outgoing transfer can begin. The driver/runtime takes advantage of this to partially overlap the data transfers.