Using SSE instructions

2019-01-31 00:44发布

I have a loop written in C++ which is executed for each element of a big integer array. Inside the loop, I mask some bits of the integer and then find the min and max values. I heard that if I use SSE instructions for these operations it will run much faster compared to a normal loop written using bitwise AND , and if-else conditions. My question is should I go for these SSE instructions? Also, what happens if my code runs on a different processor? Will it still work or these instructions are processor specific?

15条回答
smile是对你的礼貌
2楼-- · 2019-01-31 01:28

We have implemented some image processing code, similar to what you describe but on a byte array, In SSE. The speedup compared to C code is considerable, depending on the exact algorithm more than a factor of 4, even in respect to the Intel compiler. However, as you already mentioned you have the following drawbacks:

  • Portability. The code will run on every Intel-like CPU, so also AMD, but not on other CPUs. That is not a problem for us because we control the target hardware. Switching compilers and even to a 64 bit OS can also be a problem.

  • You have a steep learning curve, but I found that after you grasp the principles writing new algorithms is not that hard.

  • Maintainability. Most C or C++ programmers have no knowledge of assembly/SSE.

My advice to you will be to go for it only if you really need the performance improvement, and you can't find a function for your problem in a library like the intel IPP, and if you can live with the portability issues.

查看更多
不美不萌又怎样
3楼-- · 2019-01-31 01:31

SSE instructions were originally just on Intel chips, but recently (since Athlon?) AMD supports them as well, so if you do code against the SSE instruction set, you should be portable to most x86 procs.

That being said, it may not be worth your time to learn SSE coding unless you're already familiar with assembler on x86's - an easier option might be to check your compiler docs and see if there are options to allow the compiler to autogenerate SSE code for you. Some compilers do very well vectorizing loops in this way. (You're probably not surprised to hear that the Intel compilers do a good job of this :)

查看更多
唯我独甜
4楼-- · 2019-01-31 01:32

I can tell from my experince that SSE brings a huge (4x and up) speedup over a plain c version of the code (no inline asm, no intrinsics used) but hand-optimized assembler can beat Compiler-generated assembly if the compiler can't figure out what the programmer intended (belive me, compilers don't cover all possible code combinations and they never will). Oh and, the compiler can't everytime layout the data that it runs at the fastest-possible speed. But you need much experince for a speedup over an Intel-compiler (if possible).

查看更多
手持菜刀,她持情操
5楼-- · 2019-01-31 01:34

This kind of problem is a perfect example of where a good low level profiler is essential. (Something like VTune) It can give you a much more informed idea of where your hotspots lie.

My guess, from what you describe is that your hotspot will probably be branch prediction failures resulting from min/max calculations using if/else. Therefore, using SIMD intrinsics should allow you to use the min/max instructions, however, it might be worth just trying to use a branchless min/max caluculation instead. This might achieve most of the gains with less pain.

Something like this:

inline int 
minimum(int a, int b)
{
  int mask = (a - b) >> 31;
  return ((a & mask) | (b & ~mask));
}
查看更多
Emotional °昔
6楼-- · 2019-01-31 01:34

If you intend to use Microsoft Visual C++, you should read this:

http://www.codeproject.com/KB/recipes/sseintro.aspx

查看更多
成全新的幸福
7楼-- · 2019-01-31 01:38

SIMD, of which SSE is an example, allows you to do the same operation on multiple chunks of data. So, you won't get any advantage to using SSE as a straight replacement for the integer operations, you will only get advantages if you can do the operations on multiple data items at once. This involves loading some data values that are contiguous in memory, doing the required processing and then stepping to the next set of values in the array.

Problems:

1 If the code path is dependant on the data being processed, SIMD becomes much harder to implement. For example:

a = array [index];
a &= mask;
a >>= shift;
if (a < somevalue)
{
  a += 2;
  array [index] = a;
}
++index;

is not easy to do as SIMD:

a1 = array [index] a2 = array [index+1] a3 = array [index+2] a4 = array [index+3]
a1 &= mask         a2 &= mask           a3 &= mask           a4 &= mask
a1 >>= shift       a2 >>= shift         a3 >>= shift         a4 >>= shift
if (a1<somevalue)  if (a2<somevalue)    if (a3<somevalue)    if (a4<somevalue)
  // help! can't conditionally perform this on each column, all columns must do the same thing
index += 4

2 If the data is not contigous then loading the data into the SIMD instructions is cumbersome

3 The code is processor specific. SSE is only on IA32 (Intel/AMD) and not all IA32 cpus support SSE.

You need to analyse the algorithm and the data to see if it can be SSE'd and that requires knowing how SSE works. There's plenty of documentation on Intel's website.

查看更多
登录 后发表回答