Alignment of data members and member functions for

2019-06-02 11:49发布

问题:

Is it true aligning data members of a struct/class no longer yields the benefits it used to, especially on nehalem because of hardware improvements? If so, is it still the case that alignment will always make better performance, just very small noticeable improvements compared with on past CPUs?

Does alignment of member variables extend to member functions? I believe I once read (it could be on the wikibooks "C++ performance") that there are rules for "packing" member functions into various "units" (i.e. source files) for optimum loading into the instruction cache? (If I have got my terminology wrong here please correct me).

回答1:

Processors are still much faster than what the RAM can deliver, so they still need caches. Caches still consist of fixed-size cache lines. Also, main memory is delivered in pages and pages are accessed using a translation lookaside buffer. This buffer, again, has a fixed size cache.

Which means that both spatial and temporal locality matter a lot (i.e. how you pack stuff, and how you access it). Packing structures well (sorted by padding/alignment requirements) as opposed to packing them in some haphazard order usually results in smaller structure sizes.

Smaller structure sizes mean, if you have loads of data:

  • more structures fit into one cache line (cache miss = 50-200 cycles)
  • fewer pages are needed (page fault = 10-20 million CPU cycles)
  • fewer TLB entries are needed, fewer TLB misses (TLB miss = 50-500 cycles)

Going linearly over a few gigabytes of tightly packed SoA data can be 3 orders of magnitude faster (or 8-10 orders of magnitude, if page faults are involved) than doing the same thing in a naive way with bad layout/packing.

Whether or not you hand-align individual 4-byte or 2-byte values (say, a typical int or short) to 2 or 4 bytes makes a very small difference on recent Intel CPUs (hardly noticeable). Insofar, it may seem tempting to "optimize" on that, but I strongly advise against doing so.
This is usually something one best doesn't worry about and leaves to the compiler to figure out. If for no other reason, then because the gains are marginal at best, but some other processor architectures will raise an exception if you get it wrong. Therefore, if you try to be too smart, you'll suddenly have unexplainable crashes once you compile on some other architecture. When that happens, you'll feel sorry.

Of course, if you don't have at least several dozen of megabytes of data to process, you need not care at all.



回答2:

Aligning data to suit the processor will never hurt, but some processors will have more notable drawbacks than others, I think is the best way to answer this question.

Aligning functions into cache-line units seems a bit of a red herring to me. For small functions, what you really want is inlining if at all possible. If the code can't be inlined, then it's probably larger than a cache-line anyway. [Unless it's a virtual function, of course]. I don't think this has ever been a huge factor tho - either code is generally called often, and thus normally in the cache, or it's not called very often, and not very often in the cache. I'm sure it's possibe to come up with some code where calling one function, func1() will also drag in func2() into the cache, so if you always call func1() and func2() in short succession, it would have some benefit. But it's really not something that is that great of a benefit unless you have a lot of functions with pairs or groups of functions that are called close together. [By the way, I don't think the compiler is guaranteed to place your function code in any particular order, no matter which order you place it in the source file].

Cache-alignment is a slightly different matter, since cache-lines can still have a HUGE effect if you get it right vs. getting it wrong. This is more important for multithreading than general "loading data". The key here is to avoid sharing data in the same cache-line between processors. In a project I worked on some 10 or so years ago, a benchmark had a function that used an array of two integers to count up the number of iterations each thread did. When that got split into two separate cache-lines, the benchmark improved from 0.6x of running on a single processor to 1.98x of one processor. The same effect will happen on modern CPU's, even if they are much faster - the effect may not be exactly the same, but it will be a large slowdown (and the more processors sharing data, the more effect, so a quad-core system would be worse than a dual core, etc). This is because every time a processor updates something in a cache-line, all other processors that have read that cache-line must reload it from the processor that updated it [or from memory in the old days].