I have a C program that has n multiplications (single multiplication with n iterations) and I found another logic that has n/2 iterations of (1 multiplication + 2 additions). I know about the complexity that both are of O(n). but in terms of CPU cycles. which is faster ?
相关问题
- Multiple sockets for clients to connect to
- What is the best way to do a search in a large fil
- Faster loop: foreach vs some (performance of jsper
- glDrawElements only draws half a quad
- Why wrapping a function into a lambda potentially
Test on your computer. Or, look at the specs for your processor and guess.
The old logic no longer applies: on modern processors, an integer multiplication might be very cheap, on some newish Intel processors it's 3 clock cycles. Additions are 1 cycle on these same processors. However, in a modern pipelined processor, the stalls created by data dependencies might cause additions to take longer.
My guess is that N additions + N/2 multiplications is slower than N multiplications if you are doing a fold type operation, and I would guess the reverse for a map type operation. But this is only a guess.
Test if you want the truth.
However: Most algorithms this simple are memory-bound, and both will be the same speed.
First of all follow Dietrich Epp's first advice - measuring is (at least for complex optimization problems) the only way to be sure.
Now if you want to figure out why one is faster than the other, we can try. There are two different important performance measures: Latency and reciprocal throughput. A short summary of the two:
For Sandy bridge the rec. throughput for an
add r, r/i
(for further notice r=register, i=immediate, m=memory) is 0.33 while the latency is 1.An
imul r, r
has a latency of 3 and a rec. throughput of 1.So as you see it completely depends on your specific algorithm - if you can just replace one imul with two independent adds this particular part of your algorithm could get a theoretical speedup of 50% (and in the best case obviously a speedup of ~350%). But on the other hand if your adds add a problematic dependency one imul could be just as fast as one add.
Also note that we've ignored all the additional complications like memory and cache behavior (things which will generally have a much, MUCH larger influence on the execution time) or intricate stuff like µop fusion and whatnot. In general the only people that should care about this stuff are compiler writers - it's much simpler to just measure the result of their efforts ;)
Anyways if you want a good listing of this stuff see this here (the above description of latency/rec. throughput is also from that particular document).