I have a loop written in C++ which is executed for each element of a big integer array. Inside the loop, I mask some bits of the integer and then find the min and max values. I heard that if I use SSE instructions for these operations it will run much faster compared to a normal loop written using bitwise AND , and if-else conditions. My question is should I go for these SSE instructions? Also, what happens if my code runs on a different processor? Will it still work or these instructions are processor specific?
相关问题
- Sorting 3 numbers without branching [closed]
- How to compile C++ code in GDB?
- Why does const allow implicit conversion of refere
- thread_local variables initialization
- What uses more memory in c++? An 2 ints or 2 funct
相关文章
- Class layout in C++: Why are members sometimes ord
- How to mock methods return object with deleted cop
- Optimization techniques for backtracking regex imp
- Which is the best way to multiply a large and spar
- C++ default constructor does not initialize pointe
- Selecting only the first few characters in a strin
- What exactly do pointers store? (C++)
- Converting glm::lookat matrix to quaternion and ba
We have implemented some image processing code, similar to what you describe but on a byte array, In SSE. The speedup compared to C code is considerable, depending on the exact algorithm more than a factor of 4, even in respect to the Intel compiler. However, as you already mentioned you have the following drawbacks:
Portability. The code will run on every Intel-like CPU, so also AMD, but not on other CPUs. That is not a problem for us because we control the target hardware. Switching compilers and even to a 64 bit OS can also be a problem.
You have a steep learning curve, but I found that after you grasp the principles writing new algorithms is not that hard.
Maintainability. Most C or C++ programmers have no knowledge of assembly/SSE.
My advice to you will be to go for it only if you really need the performance improvement, and you can't find a function for your problem in a library like the intel IPP, and if you can live with the portability issues.
SSE instructions were originally just on Intel chips, but recently (since Athlon?) AMD supports them as well, so if you do code against the SSE instruction set, you should be portable to most x86 procs.
That being said, it may not be worth your time to learn SSE coding unless you're already familiar with assembler on x86's - an easier option might be to check your compiler docs and see if there are options to allow the compiler to autogenerate SSE code for you. Some compilers do very well vectorizing loops in this way. (You're probably not surprised to hear that the Intel compilers do a good job of this :)
I can tell from my experince that SSE brings a huge (4x and up) speedup over a plain c version of the code (no inline asm, no intrinsics used) but hand-optimized assembler can beat Compiler-generated assembly if the compiler can't figure out what the programmer intended (belive me, compilers don't cover all possible code combinations and they never will). Oh and, the compiler can't everytime layout the data that it runs at the fastest-possible speed. But you need much experince for a speedup over an Intel-compiler (if possible).
This kind of problem is a perfect example of where a good low level profiler is essential. (Something like VTune) It can give you a much more informed idea of where your hotspots lie.
My guess, from what you describe is that your hotspot will probably be branch prediction failures resulting from min/max calculations using if/else. Therefore, using SIMD intrinsics should allow you to use the min/max instructions, however, it might be worth just trying to use a branchless min/max caluculation instead. This might achieve most of the gains with less pain.
Something like this:
If you intend to use Microsoft Visual C++, you should read this:
http://www.codeproject.com/KB/recipes/sseintro.aspx
SIMD, of which SSE is an example, allows you to do the same operation on multiple chunks of data. So, you won't get any advantage to using SSE as a straight replacement for the integer operations, you will only get advantages if you can do the operations on multiple data items at once. This involves loading some data values that are contiguous in memory, doing the required processing and then stepping to the next set of values in the array.
Problems:
1 If the code path is dependant on the data being processed, SIMD becomes much harder to implement. For example:
is not easy to do as SIMD:
2 If the data is not contigous then loading the data into the SIMD instructions is cumbersome
3 The code is processor specific. SSE is only on IA32 (Intel/AMD) and not all IA32 cpus support SSE.
You need to analyse the algorithm and the data to see if it can be SSE'd and that requires knowing how SSE works. There's plenty of documentation on Intel's website.