I have a short to float cast in C++ that is bottlenecking my code.
The code translates from a hardware device buffer which is natively shorts, this represents the input from a fancy photon counter.
float factor= 1.0f/value;
for (int i = 0; i < W*H; i++)//25% of time is spent doing this
{
int value = source[i];//ushort -> int
destination[i] = value*factor;//int*float->float
}
A few details
Value should go from 0 to 2^16-1, it represents the pixel values of a highly sensitive camera
I'm on a multicore x86 machine with an i7 processor (i7 960 which is SSE 4.2 and 4.1).
Source is aligned to an 8 bit boundary (a requirement of the hardware device)
W*H is always divisible by 8, most of the time W and H are divisible by 8
This makes me sad, is there anything I can do about it?
I am using Visual Studios 2012...
No sure if the condition expression in the loop is evaluated only once. You can try:
I believe I have the best answer. My results are much faster than Mystical's. They only require SSE2 but take advantage of SSE3, SSE4, AVX, and even AVX2 if available. You don't have to change any code. You only have to recompile.
I ran over three sizes: 8008, 64000, and 2560*1920 = 4915200. I tried several different variations. I list the most important ones below. The function
vectorize8_unroll2
is mystical's function. I made a improved version of his calledvectorize8_unroll2_parallel
. The functionvec16_loop_unroll2_fix
andvec16_loop_unroll2_parallel_fix
are my functions which I believe are better than mystical's. These functions will automatically use AVX if you compile with AVX but work fine on SSE4 and even SSE2Additionally, you wrote "W*H is always divisible by 8, most of the time W and H are divisible by 8". So we can't assume W*H is divisible by 16 in all cases. Mystical's function
vectorize8_unroll2
has a bug when size is not a multiple of 16 (try size=8008 in his code and you will see what I mean). My code has no such bug.I'm using Ander Fog's vectorclass for the vectorization. It's not a lib or dll file. It's just a few header files. I use OpenMP for the parallelization. Here are some of the results:
Edit: I added the results on a system with AVX using GCC at the end of this answer.
Below is the code. The code only looks long because I do lots of cross checks and test many variations. Download the vectorclass at http://www.agner.org/optimize/#vectorclass . Copy the header files (vectorclass.h, instrset.h, vectorf128.h, vectorf256.h, vectorf256e.h, vectori128.h, vectori256.h, vectori256e.h) into the directory you compile from. Add /D__SSE4_2__ under C++/CommandLine. Compile in release mode. If you have a CPU with AVX then put /arch:AVX instead. Add OpenMP support under C++ properites/languages.
In the code below the function
vec16_loop_unroll2_parallel
requires the array be a multiple of 32. You can change the array size to be a multiple of 32 (that's what size2 refers to) or if that's not possible you can just use the functionvec16_loop_unroll2_parallel_fix
which has no such restriction. It's just as fast anyway.Results Using GCC on a system with AVX. GCC automatically parallelizes the loop (Visual Studio fails due to the short but works if you try int). You gain very little with hand written vectorization code. However, using multiple threads can help depending upon the array size. For the small array size 8008 OpenMP gives a worse result. However, for the larger array size 128000 using OpenMP gives much better resutls. For the largest array size 4915200 it's entirely memory bound and OpenMP does not help.
This is not a valid answer, don't take it as it, but I'm actually wondering how would the code behave by using a 256k look-up table. (basically a 'short to float' table with 65536 entries).
A CoreI7 has about 8 megabytes of cache I believe, so the look-up table would fit in the data cache.
I really wonder how that would impact the performance :)
and You can use OpenMP to hire every core of your CPU, and it is simple just do as following:
here is the result based on previous program, just add the like this:
and then here is the result
. .
result shows 100% improvment with openmp. Visual C++ supports openmp too.
Using SSE intrinsics, on my machine [Quad Core Athlon, 3.3GHz, 16GB of RAM], and
g++ -O2
optimisation [1] gives about 2.5-3x speed up. I also wrote a function to do the same thing in inline assembler, but it's not noticeably faster (again, this applies on my machine, feel free to run on other machines).I tried a variety of sizes of H * W, and it all gives approximately the same results.
[1] Using
g++ -O3
gives the same time for all four functions, as apparently-O3
enables "automatically vectorise code". So the whole thing was a bit of a waste of time assuming your compiler supports similar auto-vectorisation functionality.Results
Code
Here's a basic SSE4.1 implementation:
This assumes:
source
anddestination
are both aligned to 16 bytes.W*H
is a multiple of 8.It's possible to do better by further unrolling this loop. (see below)
The idea here is as follows:
float
s.destination
.EDIT :
It's been a while since I've done this type of optimization, so I went ahead and unrolled the loops.
Core i7 920 @ 3.5 GHz
Visual Studio 2012 - Release x64:
Further unrolling resulted in diminishing returns.
Here's the test code: