I am trying to get a good grip on data oriented design and how to program best with the cache in mind. There's basically two scenarios that I cannot quite decide which is better and why - is it better to have a vector of objects, or several vectors with the objects atomic data?
A) Vector of objects example
struct A
{
GLsizei mIndices;
GLuint mVBO;
GLuint mIndexBuffer;
GLuint mVAO;
size_t vertexDataSize;
size_t normalDataSize;
};
std::vector<A> gMeshes;
for_each(gMeshes as mesh)
{
glBindVertexArray(mesh.mVAO);
glDrawElements(GL_TRIANGLES, mesh.mIndices, GL_UNSIGNED_INT, 0);
glBindVertexArray(0);
....
}
B) Vectors with the atomic data
std::vector<GLsizei> gIndices;
std::vector<GLuint> gVBOs;
std::vector<GLuint> gIndexBuffers;
std::vector<GLuint> gVAOs;
std::vector<size_t> gVertexDataSizes;
std::vector<size_t> gNormalDataSizes;
size_t numMeshes = ...;
for (index = 0; index++; index < numMeshes)
{
glBindVertexArray(gVAOs[index]);
glDrawElements(GL_TRIANGLES, gIndices[index], GL_UNSIGNED_INT, 0);
glBindVertexArray(0);
....
}
Which one is more memory efficient and cache friendly resulting in less cache misses and better performance, and why?
With some variation according to which level of cache you're talking about, cache works as follows:
So naively the questions to ask are:
So, I would sort of expect B to be faster for this code. However:
struct
. So do that. Presumably in fact it is not the only access to the data in your program, and the other accesses might affect performance in two ways: the time they actually take, and whether they populate the cache with the data you need.I understand that this is partly opinion-based, and also that it could be a case of premature optimization, but your first option definitely has the best aesthetics. It's one vector versus six - no contest in my eyes.
For cache performance, it ought to be better. That is because the alternative requires access to two different vectors, which splits memory access every single time you render a mesh.
With the structure approach, the mesh is essentially a self-contained object and correctly implies no relation to other meshes. When drawing, you only access that mesh, and when rendering all meshes, you do one at a time in a cache-friendly manner. Yes, you will eat cache more quickly because your vector elements are larger, but you won't be contesting it.
You may also find other benefits later on from using this representation. ie if you want to store additional data about a mesh. Adding extra data in more vectors will quickly clutter your code and increase the risk of making silly errors, whereas it's trivial to make changes to the structure.
I recommend profiling with either perf or oprofile and posting your results back here (assuming you are running linux), including the number of elements you iterated across, number of iterations in total, and the hardware you tested on.
If I had to guess (and this is only a guess), I'd suspect that the first approach might be faster due to the locality of data within each structure, and hopefully the OS/hardware can prefetch additional elements for you. But again, this will depend on cache size, cache line size, and other aspects.
Defining "better" is interesting too. Are you looking for overall time to process N elements, low variance in each sample, minimal cache misses (which will be influenced by other processes running on your system), etc.
Don't forget that with STL vectors, you are also at the mercy of the allocator... e.g. it can decide at any time to reallocate the array, which will invalidate your cache. Another factor to try to isolate if you can!
Depends on your access patterns. Your first version is AoS (array of structures), second is SoA (structure of arrays).
SoA tends to use less memory (unless you store so few elements that the overhead of the arrays is actually non-trivial) if there's any kind of structure padding that you'd normally get in the AoS representation. It also tends to be a much bigger PITA to code against since you have to maintain/sync parallel arrays.
AoS tends to excel for random-access. As an example, for simplicity let's say each element fits into a cache line and is properly aligned (64 byte size and alignment, e.g.). In that case, if you are randomly accessing an
nth
element, you get all the relevant data for the element in a single cache line. If you used an SoA and dispersed those fields across separate arrays, you'd have to load memory into multiple cache lines just to load the data for that one element. And because we're accessing the data in a random pattern, we don't benefit from spatial locality much at all since the next element we're going to be accessing could be somewhere completely else in memory.However, SoA tends to excel for sequential access mainly because there's often less data to load into the CPU cache in the first place for the entire sequential loop because it excludes structure padding and cold fields. By cold fields, I mean fields you don't need to access in a particular sequential loop. For example, a physics system might not care about particle fields involved with how the particle looks to the user, like color and a sprite handle. That's irrelevant data. It only cares about particle positions. The SoA allows you to avoid loading that irrelevant data into cache lines. It allows you to load as much relevant data into a cache line at once so you end up with fewer compulsory cache misses (as well as page faults for large enough data) with the SoA.
That's also only covering memory access patterns. With SoA reps, you also tend to be able to write more efficient and simpler SIMD instructions. But again it's mainly suited for sequential access.
You can also mix the two concepts. You might use an AoS for hot fields frequently accessed together in random-access patterns, then hoist out the cold fields and store them in parallel.