i'm writing a C# class to perform 2D separable convolution using integers to obtain better performance than double counterpart. The problem is that i don't obtain a real performance gain.
This is the X filter code (it is valid both for int and double cases):
foreach (pixel)
{
int value = 0;
for (int k = 0; k < filterOffsetsX.Length; k++)
{
value += InputImage[index + filterOffsetsX[k]] * filterValuesX[k]; //index is relative to current pixel position
}
tempImage[index] = value;
}
In the integer case "value", "InputImage" and "tempImage" are of "int", "Image<byte>
" and "Image<int>
" types.
In the double case "value", "InputImage" and "tempImage" are of "double", "Image<double>
" and "Image<double>
" types.
(filterValues is int[] in each case)
(The class Image<T>
is part of an extern dll. It should be similar to .NET Drawing Image class..).
My goal is to achieve fast perfomance thanks to int += (byte * int) vs double += (double * int)
The following times are mean of 200 repetitions.
Filter size 9 = 0.031 (double) 0.027 (int)
Filter size 13 = 0.042 (double) 0.038 (int)
Filter size 25 = 0.078 (double) 0.070 (int)
The performance gain is minimal. Can this be caused by pipeline stall and suboptimal code?
EDIT: simplified the code deleting unimportant vars.
EDIT2: i don't think i have a cache miss related problema because "index"iterate through adjacent memory cells (row after row fashion). Moreover "filterOffstetsX" contains only small offsets relatives to pixels on the same row and at a max distance of filter size / 2. The problem can be present in the second separable filter (Y-filter) but times are not so different.
If the times you measuerd are accurate, then the runtime of your filtering algorithm seems to grow with the cube of the filter size. What kind of filter is that? Maybe you can reduce the number of multiplications needed. (e.g. if you're using a separable filter kernel?)
Otherwise, if you need raw performance, you might consider using a library like the Intel Performance Primitives - it contains highly optimized functions for things like this that use CPU SIMD instructions. They're usually a lot faster than hand-written code in C# or C++.
Using Visual C++, because that way I can be sure that I'm timing arithmetic operations and not much else.
Results (each operation is performed 600 million times):
CPU is an Intel Core i7, Turbo Boosted to 2.53 GHz.
Benchmark code:
I did a default optimized build using Visual C++ 2010 32-bit.
Every call to
profile
,add
,mult
, anddivide
(inside the loops) got inlined. Function calls were still generated toprofile
, but since 60 million operations get done for each call, I think the function call overhead is unimportant.Even with
volatile
thrown in, the Visual C++ optimizing compiler is SMART. I originally used small integers as the right-hand operand, and the compiler happily usedlea
andadd
instructions to do integer multiply. You may get more of a boost from calling out to highly optimized C++ code than the common wisdom suggests, simply because the C++ optimizer does a much better job than any JIT.Originally I had the initialization of
var
outside the loop, and that made the floating-point multiply code run miserably slow because of the constant overflows. FPU handling NaNs is slow, something else to keep in mind when writing high-performance number-crunching routines.The dependencies are also set up in such a way as to prevent pipelining. If you want to see the effects of pipelining, say so in a comment, and I'll revise the testbench to operate on multiple variables instead of just one.
Disassembly of i32 multiply:
And of f64 multiply:
at least it is not fair to compare
int
(DWORD, 4 bytes) anddouble
(QWORD, 8 bytes) on 32-bit system. Compareint
tofloat
orlong
todouble
to get fair results.double
has increased precision, you must pay for it.PS: for me it smells like micro(+premature) optimization, and that smell is not good.
Edit: Ok, good point. It is not correct to compare long to double, but still comparing int and double on 32 CPU is not correct even if they have both intrinsic instructions. This is not magic, x86 is fat CISC, still double is not processed as single step internally.
It seems like you are saying you are only running that inner loop 5000 times in even your longest case. The FPU last I checked (admittedly a long time ago) only took about 5 more cycles to perform a multiply than the integer unit. So by using integers you would be saving about 25,000 CPU cycles. That's assuming no cache misses or anything else that would cause the CPU to sit and wait in either event.
Assuming a modern Intel Core CPU clocked in the neighborhood of 2.5Ghz, You could expect to have saved about 10 microseconds runtime by using the integer unit. Kinda paltry. I do realtime programming for a living, and we wouldn't sweat that much CPU wastage here, even if we were missing a deadline somewhere.
digEmAll makes a very good point in the comments though. If the compiler and optimizer are doing their jobs, the entire thing is pipelined. That means that in actuality the entire innner loop will take 5 cycles longer to run with the FPU than the Integer Unit, not each operation in it. If that were the case, your expected time savings would be so small it would be tough to measure them.
If you really are doing enough floating-point ops to make the entire shebang take a very long time, I'd suggest looking into doing one or more of the following:
Did you try looking at the disassembled code? In high-level languages i'm pretty much trusting the compiler to optimize my code. For example
for(i=0;i<imageSize;i++)
might be faster thanforeach
. Also, arithmetic operrations might get optimized by the compiler anyway.... when you need to optimize something you either optimize the whole "black-box" and maybe reinvent the algorithm used in that loop, or you first take a look at the dissasembled code and see whats wrong with itYour algorithm seems to access large regions of memory in a very non-sequential pattern. It's probably generating tons of cache misses. The bottleneck is probably memory access, not arithmetic. Using ints should make this slightly faster because ints are 32 bits, while doubles are 64 bits, meaning cache will be used slightly more efficiently. If almost every loop iteration involves a cache miss, though, you're basically out of luck unless you can make some algorithmic or data structure layout changes to improve the locality of reference.
BTW, have you considered using an FFT for convolution? That would put you in a completely different big-O class.