Does GPGPU programming only allow the execution of SIMD instructions? If so then it must be a tedious task to re write an algorithm that has been designed to run on a general CPU to run on a GPU? Also is there a pattern in algorithms that can be converted to SIMD architecture?
问题:
回答1:
Well, it's not quite exact that GPGPU only supports SIMD execution. Many GPUs have some non-SIMD components. But, overall, to take full advantage of a GPU you need to be running SIMD code.
However, you are NOT necessarily writing SIMD instructions. I.e. GPU SIMD is not the same as CPU SIMD - i.e. not the same as writing code to take advantage of x86 SSE (Stream SIMD Extensions), etc. Indeed, as one of the people who brough CPU SIMD to you (I was heavily involved in Intel MMX, one of the earliest such, and have followed the evolution to FP SIMD) I often feel obliged to correct people who say that CPU's like Intel have SIMD instructions. I prefer to consider them packed vector instructions, although I grudgingly call them SIMD packed vector instruction sets just because everyone misuses the name. I also emphasize that CPU SIMD isntruction sets such as MMX and SSE may have SIMD packed vector execution units - integer and floating point ALUs, etc. - but they don't have SIMD control flow, and they usually don't have SIMD memory access (aka scatter/gather (although Intel Larrabee was moving in that direction)).
Some pages on my comp-arch.net wiki about this (I write about computer architecture for my hobby): - http://semipublic.comp-arch.net/wiki/SIMD - http://semipublic.comp-arch.net/wiki/SIMD_packed_vector - http://semipublic.comp-arch.net/wiki/Difference_between_vector_and_packed_vector - http://semipublic.comp-arch.net/wiki/Single_Instruction_Multiple_Threads_(SIMT) although I apologize for not yet having written the page that talks about SIMD packed vector instruction sers, as in Intel MMX or SIMD.
But I don't expect you to read all of the above. Let me try to explain.
Imagine that you have a piece of code that looks something like this, when written in a simple, scalar, manner:
// operating on an array with one million 32b floating point elements A[1000000]
for i from 0 upto 999999 do
if some_condition(A[i]) then
A[i] = function1(A[i])
else
A[i] = function2(A[i])
where function1() and function2() are simple enough to inline - say function1(x) = x*x and function2(x) = sqrt(x).
On a CPU. to use something like SSE, you would have to (1) divide the array up into chunks, say the size of the 256 bit AVX , (2) handle the IF statement yourself, using masks or the like. Something like:
for i from 0 upto 999999 by 8 do
register tmp256b_1 = load256b(&A[i])
register tmp256b_2 = tmp256b_1 * tmp256b_1
register tmp256b_3 = _mm_sqrt_ps(tmp256b_1) // this is an "intrinsic"
// a function, possibly inlined
// doing a Newton Raphson to evaluate sqrt.
register mask256b = ... code that arranges for you to have 32 1s in the "lane"
where some_condition is true, and 0s elsewhere...
register tmp256b_4 = (tmp256b_1 & mask) | (tmp256b_3 | ~mask);
store256b(&A[i],tmp256b_4)
You may not think this is so bad, but remember, this is a simple example. Imagine multiple nested IFs, and so on. Or, imagine that "some_condition" is clumpy, so that you might save a lot of unnecessary computation by skipping sections where it is all function1 or all function2...
for i from 0 upto 999999 by 8 do
register mask256b = ... code that arranges for you to have 32 1s in the "lane"
where some_condition is true, and 0s elsewhere...
register tmp256b_1 = load256b(A[i])
if mask256b == ~0 then
register tmp256b_2 = tmp256b_1 * tmp256b_1
store256b(&A[i],tmp256b_2)
else mask256b == 0 then
register tmp256b_3 = _mm_sqrt_ps(tmp256b_1) // this is an "intrinsic"
store256b(&A[i],tmp256b_3)
else
register tmp256b_1 = load256b(&A[i])
register tmp256b_2 = tmp256b_1 * tmp256b_1
register tmp256b_3 = _mm_sqrt_ps(tmp256b_1)
register tmp256b_4 = (tmp256b_1 & mask) | (tmp256b_3 | ~mask);
store256b(&A[i],tmp256b_4)
I think you can get the picture? And it gets even more complicated when you have multiple arrays, and sometimes the data is aligned on a 256 bit boundary, and sometimes not (as is typical, say, in stencil computations, where you operate on all alignments).
Now, here's roughly what it looks like on something like a GPU:
// operating on an array with one million 32b floating point elements A[1000000]
for all i from 0 upto 999999 do
if some_condition(A) then
A = function1(A)
else
A = function2(A)
Doesn't that look a lot more like the original scalar code? The only real difference is that you have lost the array indexes, A[i]. (Actually, some GPGPU languages keep the array indexes in, but most that I know of do not.)
Now, I have left out (a) Open/CL's C-like syntax, (b) all of the setup that you need to connect the Open/CL code to your C or C++ code (there are much better languages than CUDA or OpenCL - these have a lot of cruft. But they are available many places, on both CPUs and GPUs[**]). But I think I have presented the heart of the matter:
The key thing about GPGPU computation is that you write SIMD, data parallel cold. But you write it at a higher level than you write CPU-style SSE code. Higher level even than the compiler intrinsics.
First, the GPGPU compiler, e.g. the OpenCL or CUDA compiler, handle a lot of the management of data behind your back. The compiler arranges to do the control flow, tghe IF statements, etc.
By the way, note, as I marked with a [**], that sometimes a so called SIMD GPGPU compiler can generate code that wil run on both CPUs and GPUs. I.e. a SIMD compiler can generate code that uses CPU SIMD instructiin sets.
But GPUs themselves have special hardware support that runs this SIMD code, appropriately compiled, much faster than it can run on the CPU using CPU SIMD instructions. Most importantly, the GPUs have many more execution units - e.g. a CPU like AMD Bulldoser has 2 sets of 128-bit-wide FMACS, i.e. is capable of doing 8 FMACs per cycle. Times the number of CPUs on a chip - say 8 - giving you maybe 64 per cycle. Whereas a modern GPU may have 2,048 32b FMACs every cycle. Even if running at 1/2 or 1/4 the clock rate, that's a big difference.
How can the GPUs have so much more hardware? Well, first, they are usually bigger chips than the CPU. But, also, they tend not to spend (some say "waste") hardware on things like big caches and out-of-order execution that CPUs spend it on. CPUs try to make one or a few computations fast, whereas GPUs do many computations in parallel, but individually slower than the CPU. Still, the total number of computations that the GPU can do per second is much higher than a CPU can do.
The FGPUs have other hardware optimizations. For example, they run many more threads than a CPU. Whereas an Intel CPU has 2 hyperthreads per CPU, giving you 16 threads on an 8 CPU core chip, a GPU may have hundreds. And so on.
Most interesting to me as a computer architect, many GPUs have special hardware support for SIMD control flow. They make manipulating those masks much more efficient than on a CPU running SSE.
And so on.
Anyway, I hope that I have made my point
While you do have to write SIMD code to run on a GPGPU system (like OpenCL).
You should not confuse this sort of SIMD with the SIMD code you have to write to take advantage of Intel SSE.
It's much cleaner.
More and more compilers are allowing the same code to run on both DCPU and GPU. I.e. they are increasingly supporting the clean "real SIMD" coding style, rather than the fake "pseudo-SIMD" coding style that has been necessary to take advantage of MMX and SSE and AVX up til now. This is good - such code is equally "nice" to program on both CPU and GPU. But the GPU often runs it much faster. There's a paper by Intel called "Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU", http://www.hwsw.hu/kepek/hirek/2010/06/p451-lee.pdf. It says GPUs are "only" 2.5X faster on average. But that's after a lot of aggressive optimization. The GPU code is often easier to write. And I don't know about you, but I think "only" 2.5X faster is not that much to sneeze at. Especially since the GPGPU code is often easier to read.
Now, there's no free lunch. If your code is naturally data parallel, great. But some coede is not. That can be a pain.
And, like all machines, GPUs have their quirks.
But if your code is naturally data parallel, you may get great speedups, with code that is much more readable.
I'm a CPU designer. I expect to borrow lots of ideas from GPUs to male CPUs run faster, and vice versa.