I know that you should only optimize things when it is deemed necessary. But, if it is deemed necessary, what are your favorite low level (as opposed to algorithmic level) optimization tricks.
For example: loop unrolling.
I know that you should only optimize things when it is deemed necessary. But, if it is deemed necessary, what are your favorite low level (as opposed to algorithmic level) optimization tricks.
For example: loop unrolling.
++i
can be faster thani++
, because it avoids creating a temporary.Whether this still holds for modern C/C++/Java/C# compilers, I don't know. It might well be different for user-defined types with overloaded operators, whereas in the case of simple integers it probably doesn't matter.
But I've come to like the syntax... it reads like "increment i" which is a sensible order.
Optimizing cache locality - for example when multiplying two matrices that don't fit into cache.
Rolling up loops.
Seriously, the last time I needed to do anything like this was in a function that took 80% of the runtime, so it was worth trying to micro-optimize if I could get a noticeable performance increase.
The first thing I did was to roll up the loop. This gave me a very significant speed increase. I believe this was a matter of cache locality.
The next thing I did was add a layer of indirection, and put some more logic into the loop, which allowed me to only loop through the things I needed. This wasn't as much of a speed increase, but it was worth doing.
If you're going to micro-optimize, you need to have a reasonable idea of two things: the architecture you're actually using (which is vastly different from the systems I grew up with, at least for micro-optimization purposes), and what the compiler will do for you.
A lot of the traditional micro-optimizations trade space for time. Nowadays, using more space increases the chances of a cache miss, and there goes your performance. Moreover, a lot of them are now done by modern compilers, and typically better than you're likely to do them.
Currently, you should (a) profile to see if you need to micro-optimize, and then (b) try to trade computation for space, in the hope of keeping as much as possible in cache. Finally, run some tests, so you know if you've improved things or screwed them up. Modern compilers and chips are far too complex for you to keep a good mental model, and the only way you'll know if some optimization works or not is to test.
I was amazed at the speedup I got by replacing a for loop adding numbers together in structs:
Why doesn't gcc optimise for loops into this? Or is there something I missed? Some cache effect?
Liberal use of __restrict to eliminate load-hit-store stalls.
Jon Bentley's Writing Efficient Programs is a great source of low- and high-level techniques -- if you can find a copy.