可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

Good Day,

Suppose that you have a simple for loop like below...

for(int i=0;i<10;i++)
{
    //statement 1
    //statement 2
}

Assume that statement 1 and statement 2 were O(1). Besides the small overhead of "starting" another loop, would breaking down that for loop into two (not nested, but sequential) loops be as equally fast? For example...

for(int i=0;i<10;i++)
{
    //statement 1
}
for(int i=0;i<10;i++)
{
    //statement 2
}

Why I ask such a silly question is that I have a Collision Detection System(CDS) that has to loop through all the objects. I want to "compartmentalize" the functionality of my CDS system so I can simply call

cds.update(objectlist);

instead of having to break my cds system up. (Don't worry too much about my CDS implementation... I think I know what I am doing, I just don't know how to explain it, what I really need to know is if I take a huge performance hit for looping through all my objects again.

回答1:

It depends on your application.

Possible Drawbacks (of splitting):

your data does not fit into the L1 data cache, therefore you load it once for the first loop and then reload it for the second loop

Possible Gains (of splitting):

your loop contains many variables, splitting helps reducing register/stack pressure and the optimizer turns it into better machine code
the functions you use trash the L1 instruction cache so the cache is loaded on each iteration, while by splitting you manage into loading it once (only) at the first iteration of each loop

These lists are certainly not comprehensive, but already you can sense that there is a tension between code and data. So it is difficult for us to take an educated/a wild guess when we know neither.

In doubt: profile. Use callgrind, check the cache misses in each case, check the number of instructions executed. Measure the time spent.

回答2:

In terms of algorithmic complexity splitting the loops makes no difference.

In terms of real world performance splitting the loops could improve performance, worsen performance or make no difference - it depends on the OS, hardware and - of course - what statement 1 and statement 2 are.

回答3:

With two loops you will be paying for:

increased generated code size
2x as many branch predicts
depending what the data layout of statement 1 and 2 are you could be reloading data into cache.

The last point could have a huge impact in either direction. You should measure as with any perf optimization.

回答4:

As far as the big-o complexity is concerned, this doesn't make a difference if 1 loop is O(n), then so is the 2 loop solution.
As far as micro-optimisation, it is hard to say. The cost of a loop is rather small, we don't know what the cost of accessing your objects is (if they are in a vector, then it should be rather small too), but there is a lot to consider to give a useful answer.

回答5:

As noted, the complexity remains.

But in the real world, it is impossible for us to predict which version runs faster. The following are factors that play roles, huge ones:

Data caching
Instruction caching
Speculative execution
Branch prediction
Branch target buffers
Number of available registers on the CPU
Cache sizes

(note: over all of them, there's the Damocles sword of misprediction; all are wikipedizable and googlable)

Especially the last factor makes it sometimes impossible to compile the one true code for code whose performance relies on specific cache sizes. Some applications will run faster on CPU with huge caches, while running slower on small caches, and for some other applications it will be the opposite.

Solutions:

Let your compiler do the job of loop transformation. Modern g++'s are quite good in that discipline. Another discipline that g++ is good at is automatic vectorization. Be aware that compilers know more about computer architecture than almost all people.
Ship different binaries and a dispatcher.
Use cache-oblivious data structures/layouts and algorithms that adapt to the target cache.

It is always a good idea to endeavor for software that adapts to the target, ideally without sacrificing code quality. And before doing manual optimization, either microscopic or macroscopic, measure real world runs, then and only then optimize.

Literature: * Agner Fog's Guides * Intel's Guides

回答6:

You're correct in noting that there will be some performance overhead by creating a second loop. Therefore, it cannot be "equally fast"; as this overhead, while small, is still overhead.

I won't try to speak intelligently about how collision systems should be built, but if you're trying to optimize performance it's better to avoid building unnecessary control structures if you can manage it without pulling your hair out.

Remember that premature optimization is one of the worst things you can do. Worry about optimization when you have a performance problem, in my opinion.