Using SSE instructions

I have a loop written in C++ which is executed for each element of a big integer array. Inside the loop, I mask some bits of the integer and then find the min and max values. I heard that if I use SSE instructions for these operations it will run much faster compared to a normal loop written using bitwise AND , and if-else conditions. My question is should I go for these SSE instructions? Also, what happens if my code runs on a different processor? Will it still work or these instructions are processor specific?

标签： c++ optimization assembly processor sse

15条回答

smile是对你的礼貌

2楼-- · 2019-01-31 01:28

We have implemented some image processing code, similar to what you describe but on a byte array, In SSE. The speedup compared to C code is considerable, depending on the exact algorithm more than a factor of 4, even in respect to the Intel compiler. However, as you already mentioned you have the following drawbacks:

Portability. The code will run on every Intel-like CPU, so also AMD, but not on other CPUs. That is not a problem for us because we control the target hardware. Switching compilers and even to a 64 bit OS can also be a problem.
You have a steep learning curve, but I found that after you grasp the principles writing new algorithms is not that hard.
Maintainability. Most C or C++ programmers have no knowledge of assembly/SSE.

My advice to you will be to go for it only if you really need the performance improvement, and you can't find a function for your problem in a library like the intel IPP, and if you can live with the portability issues.

0人赞添加讨论(0) 举报

不美不萌又怎样

3楼-- · 2019-01-31 01:31

SSE instructions were originally just on Intel chips, but recently (since Athlon?) AMD supports them as well, so if you do code against the SSE instruction set, you should be portable to most x86 procs.

That being said, it may not be worth your time to learn SSE coding unless you're already familiar with assembler on x86's - an easier option might be to check your compiler docs and see if there are options to allow the compiler to autogenerate SSE code for you. Some compilers do very well vectorizing loops in this way. (You're probably not surprised to hear that the Intel compilers do a good job of this :)

0人赞添加讨论(0) 举报

唯我独甜

4楼-- · 2019-01-31 01:32

I can tell from my experince that SSE brings a huge (4x and up) speedup over a plain c version of the code (no inline asm, no intrinsics used) but hand-optimized assembler can beat Compiler-generated assembly if the compiler can't figure out what the programmer intended (belive me, compilers don't cover all possible code combinations and they never will). Oh and, the compiler can't everytime layout the data that it runs at the fastest-possible speed. But you need much experince for a speedup over an Intel-compiler (if possible).

0人赞添加讨论(0) 举报

手持菜刀，她持情操

5楼-- · 2019-01-31 01:34

This kind of problem is a perfect example of where a good low level profiler is essential. (Something like VTune) It can give you a much more informed idea of where your hotspots lie.

My guess, from what you describe is that your hotspot will probably be branch prediction failures resulting from min/max calculations using if/else. Therefore, using SIMD intrinsics should allow you to use the min/max instructions, however, it might be worth just trying to use a branchless min/max caluculation instead. This might achieve most of the gains with less pain.

Something like this:

inline int 
minimum(int a, int b)
{
  int mask = (a - b) >> 31;
  return ((a & mask) | (b & ~mask));
}

0人赞添加讨论(0) 举报

Emotional °昔

6楼-- · 2019-01-31 01:34

If you intend to use Microsoft Visual C++, you should read this:

http://www.codeproject.com/KB/recipes/sseintro.aspx

0人赞添加讨论(0) 举报

成全新的幸福

7楼-- · 2019-01-31 01:38

SIMD, of which SSE is an example, allows you to do the same operation on multiple chunks of data. So, you won't get any advantage to using SSE as a straight replacement for the integer operations, you will only get advantages if you can do the operations on multiple data items at once. This involves loading some data values that are contiguous in memory, doing the required processing and then stepping to the next set of values in the array.

Problems:

1 If the code path is dependant on the data being processed, SIMD becomes much harder to implement. For example:

a = array [index];
a &= mask;
a >>= shift;
if (a < somevalue)
{
  a += 2;
  array [index] = a;
}
++index;

is not easy to do as SIMD:

a1 = array [index] a2 = array [index+1] a3 = array [index+2] a4 = array [index+3]
a1 &= mask         a2 &= mask           a3 &= mask           a4 &= mask
a1 >>= shift       a2 >>= shift         a3 >>= shift         a4 >>= shift
if (a1<somevalue)  if (a2<somevalue)    if (a3<somevalue)    if (a4<somevalue)
  // help! can't conditionally perform this on each column, all columns must do the same thing
index += 4

2 If the data is not contigous then loading the data into the SIMD instructions is cumbersome

3 The code is processor specific. SSE is only on IA32 (Intel/AMD) and not all IA32 cpus support SSE.

You need to analyse the algorithm and the data to see if it can be SSE'd and that requires knowing how SSE works. There's plenty of documentation on Intel's website.

0人赞添加讨论(0) 举报

1 2 3 下一页

Using SSE instructions

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间