I am trying to measure the execution time of GPU and compare it with CPU. I wrote a simple_add function to add all elements of a short int vector. The Kernel code is:
global const int * A, global const uint * B, global int* C)
{
///------------------------------------------------
/// Add 16 bits of each
int AA=A[get_global_id(0)];
int BB=B[get_global_id(0)];
int AH=0xFFFF0000 & AA;
int AL=0x0000FFFF & AA;
int BH=0xFFFF0000 & BB;
int BL=0x0000FFFF & BB;
int CL=(AL+BL)&0x0000FFFF;
int CH=(AH+BH)&0xFFFF0000;
C[get_global_id(0)]=CH|CL;
}
I wrote another CPU version for this function and after 100 time executions measured their execution time
clock_t before_GPU = clock();
for(int i=0;i<100;i++)
{
queue.enqueueNDRangeKernel(kernel_add,1,
cl::NDRange((size_t)(NumberOfAllElements/4)),cl::NDRange(64));
queue.finish();
}
clock_t after_GPU = clock();
clock_t before_CPU = clock();
for(int i=0;i<100;i++)
AddImagesCPU(A,B,C);
clock_t after_CPU = clock();
the result was as below after 10 times calling the whole measurement function:
CPU time: 1359
GPU time: 1372
----------------
CPU time: 1336
GPU time: 1269
----------------
CPU time: 1436
GPU time: 1255
----------------
CPU time: 1304
GPU time: 1266
----------------
CPU time: 1305
GPU time: 1252
----------------
CPU time: 1313
GPU time: 1255
----------------
CPU time: 1313
GPU time: 1253
----------------
CPU time: 1384
GPU time: 1254
----------------
CPU time: 1300
GPU time: 1254
----------------
CPU time: 1322
GPU time: 1254
----------------
The problem is that I really expected GPU to be much faster than CPU but it was not. I can't understand why my GPU speed is not much higher than CPU. Is there any problem in my codes ?? Here is my GPU properties:
-----------------------------------------------------
------------- Selected Platform Properties-------------:
NAME: AMD Accelerated Parallel Processing
EXTENSION: cl_khr_icd cl_amd_event_callback cl_amd_offline_devices cl_khr_d3d10_sharing
VENDOR: Advanced Micro Devices, Inc.
VERSION: OpenCL 1.2 AMD-APP (937.2)
PROFILE: FULL_PROFILE
-----------------------------------------------------
------------- Selected Device Properties-------------:
NAME : ATI RV730
TYPE : 4
VENDOR : Advanced Micro Devices, Inc.
PROFILE : FULL_PROFILE
VERSION : OpenCL 1.0 AMD-APP (937.2)
EXTENSIONS : cl_khr_gl_sharing cl_amd_device_attribute_query cl_khr_d3d10_sharing
MAX_COMPUTE_UNITS : 8
MAX_WORK_GROUP_SIZE : 128
OPENCL_C_VERSION : OpenCL C 1.0
DRIVER_VERSION: CAL 1.4.1734
==========================================================
and just to compare this is my CPU specifications:
------------- CPU Properties-------------:
NAME : Intel(R) Core(TM) i3-2100 CPU @ 3.10GHz
TYPE : 2
VENDOR : GenuineIntel
PROFILE : FULL_PROFILE
VERSION : OpenCL 1.2 AMD-APP (937.2)
MAX_COMPUTE_UNITS : 4
MAX_WORK_GROUP_SIZE : 1024
OPENCL_C_VERSION : OpenCL C 1.2
DRIVER_VERSION: 2.0 (sse2,avx)
==========================================================
I also measured the wall clock time using QueryPerformanceCounter and here is the results:
CPU time: 1304449.6 micro-sec
GPU time: 1401740.82 micro-sec
----------------------
CPU time: 1620076.55 micro-sec
GPU time: 1310317.64 micro-sec
----------------------
CPU time: 1468520.44 micro-sec
GPU time: 1317153.63 micro-sec
----------------------
CPU time: 1304367.29 micro-sec
GPU time: 1251865.14 micro-sec
----------------------
CPU time: 1301589.17 micro-sec
GPU time: 1252889.4 micro-sec
----------------------
CPU time: 1294750.21 micro-sec
GPU time: 1257017.41 micro-sec
----------------------
CPU time: 1297506.93 micro-sec
GPU time: 1252768.9 micro-sec
----------------------
CPU time: 1293511.29 micro-sec
GPU time: 1252019.88 micro-sec
----------------------
CPU time: 1320753.54 micro-sec
GPU time: 1248895.73 micro-sec
----------------------
CPU time: 1296486.95 micro-sec
GPU time: 1255207.91 micro-sec
----------------------
Again I tried the opencl profiling for execution time.
queue.enqueueNDRangeKernel(kernel_add,1,
cl::NDRange((size_t)(NumberOfAllElements/4)),
cl::NDRange(64),NULL,&ev);
ev.wait();
queue.finish();
time_start=ev.getProfilingInfo<CL_PROFILING_COMMAND_START>();
time_end=ev.getProfilingInfo<CL_PROFILING_COMMAND_END>();
Results for one time execution were more or less the same:
CPU time: 13335.1815 micro-sec
GPU time: 11865.111 micro-sec
----------------------
CPU time: 13884.0235 micro-sec
GPU time: 11663.889 micro-sec
----------------------
CPU time: 19724.7296 micro-sec
GPU time: 14548.222 micro-sec
----------------------
CPU time: 19945.3199 micro-sec
GPU time: 15331.111 micro-sec
----------------------
CPU time: 17973.5055 micro-sec
GPU time: 11641.444 micro-sec
----------------------
CPU time: 12652.6683 micro-sec
GPU time: 11632 micro-sec
----------------------
CPU time: 18875.292 micro-sec
GPU time: 14783.111 micro-sec
----------------------
CPU time: 32782.033 micro-sec
GPU time: 11650.444 micro-sec
----------------------
CPU time: 20462.2257 micro-sec
GPU time: 11647.778 micro-sec
----------------------
CPU time: 14529.6618 micro-sec
GPU time: 11860.112 micro-sec
ATI RV730 has VLIW structure so it is better to try
uint4
andint4
vector types with 1/4 number of total threads (which is NumberOfAllElements/16). This would also help loading from memory faster for each work item.Also kernel doesn't have much calculations compared to memory operations. Making buffers mapped to RAM would have better performance. Don't copy arrays, map them to memory using map/unmap enqueue commands.
If its still not faster, you can use both gpu and cpu at the same time to work on first half and second half of work to finish it in %50 time.
Also don't put clFinish in loop. Put it just after the end of loop. This way it will enqueue it much faster and it already has in-order execution so it won't start others before finishing the first item. It is in-order queue I suppose and adding clfinish after each enqueue is extra overhead. Only a single clfinish after latest kernel is enough.
ATI RV730: 64 VLIW units, each has at least 4 streaming cores. 750 MHz.
i3-2100: 2 cores(threads just for anti-bubbling) each having AVX that capable of streaming 8 operations simultaneously. So this can have 16 operations in flight. More than 3 GHz.
Simply multiplication of streaming operations with frequencies:
ATI RV730 = 192 units (more with multiply-add functions, by 5th element of each vliw)
i3-2100 = 48 units
so gpu should be at least 4x as fast(use int4, uint4). This is for simple ALU and FPU operations such as bitwise operations and multiplications. Special functions such as trancandentals performance could be different since they run only on 5th unit in each vliw.
I did some extra tests and realized that the GPU is optimized for floating point operations. I changed the the test code as below:
and got the result that I expected (about 10 time faster):
for a bit heavier floating point operations like below:
The result was more or less the same: