How can a program's size increase the rate of

2019-07-25 01:06发布

问题:

Wikipedia has the following statement in its Loop Unrolling article:

Increased program code size, which can be undesirable, particularly for embedded applications. Can also cause an increase in instruction cache misses, which may adversely affect performance.

Why is this?

Also, having a program's code size be larger due to large amounts of dead code won't increase the rate of cache misses, since dead code won't be executed?

回答1:

Code is typically read into caches in whole cache lines, which might be 64, 128 or 256 bytes. If you have a 256 byte cache line and three quarters of that is dead code, then you are not using your cache memory very well. On the other hand, if you have a megabyte of completely unused code, that doesn't affect cache efficiency at all.

Some compilers will use heuristics or hints by the software developer to find out which code is probably very rarely used, and arrange code so that one cache line will be completely filled with used code or completely filled with unused code.

And unlike dead code, used code that is inflated by loop unrolling will increase cache misses.



回答2:

The Wikipedia article on loop unrolling lists several possible downsides of this technique, two of which are related to code size:

  • Increased program code size, which can be undesirable, particularly for embedded applications. Can also cause an increase in instruction cache misses, which may adversely affect performance.

  • If the code in the body of the loop involves function calls, it may not be possible to combine unrolling with inlining, since the increase in code size might be excessive. Thus there can be a trade-off between the two optimizations.

Loop unrolling increases the static code size because it duplicates part of the code that is in a loop. The hope is that it will decrease the dynamic instruction count because fewer iterations of the loop will need to be executed.

For a small loop body the increased number of instructions in the loop body probably won't have a negative impact on the instruction cache hit rate. But as the loop body gets larger the increased code size can cause other useful lines in the instruction cache to be evicted, and the overall hit rate might be reduced. This is primarily because of the larger amount of code that is executed.

Loop unrolling become particularly complex when the loop body contains function calls and the compiler has to determine whether to inline the functions or not. Or if the loop is part of a function that might be inlined elsewhere. Both inlining and loop unrolling have the potential to decrease dynamic instruction count, but they also increase static code size. So the compiler typically uses heuristics to choose how to combine these optimizations since the problem can't be solved optimally (at least not in a reasonable amount of time).

As mentioned in one of the other answer alignment issues might also cause more instruction cache misses. However, the primary reason loop unrolling can cause more cache misses is that it increases the amount of code actually executed in the loop body (as well as the additional prologue and epilogue code).



回答3:

Unrolling loops makes them faster, but at the cost of extra instruction cache misses, and while we have an instruction cache miss we can't schedule any useful code.

So why do we unroll anyway (if we got the memory to do it)?

Because if my unrolling 4 times causes the loop to run 25% faster over a 1000 iterations, its still faster to pay the cost after the loop of 1 cache miss.

So you have to look at the relative cost of using more code where its gaining something vs. the extra cost of cache misses.

Extra code also causes some extra TLB misses but the argument still holds.