I am in need of a 2bit array, I am not concerned with saving memory at all, but I am concerned with minimizing cache misses and maximizing cache efficiency. Using an array of bools will use 4 times more memory, which means for every usable chunk of data in the cache, there will be 3 that are not used. So technically, I can get 3 times better cache consistency if I use bitfields.
The plan is to implement it as an array of bytes, divided into 4 equal bitfields, and use the div function to be able to get the integral quotient and remainder, possibly in a single clock, and use those to access the right index and right bitfield.
The array I needs is about 10000 elements long, so it will make for a significantly denser packed data, using 2 actual bits will allow for the entire array to fit in L1 cache, while using a byte array this will not be possible.
So my question is whether someone can tell me if this is a good idea in a performance oriented task, so I know if it is worth to go forth and implement a 2bit array? And surely, the best way to know is profiling, but any information in advance may be useful and will be appreciated.
With 10000 elements, on a modern processor, it should fit nicely in memory as bytes (10KB), so I wouldn't worry too much about it, unless you want this to run on some very tiny microprocessor with a cache that is much smaller than the typical 16-32KB L1 caches that modern CPU's have.
Of course, you may well want to TEST the performance with different solutions, if you think this is an important part of your code from a performance perspective [as measured from your profiling that you've already done before you start optimising, right?].
It's not clear to me that this will result in a performance gain. Accessing each field will require several instructions (
(data[i / 4] >> 2 * (i % 4)) & 0x03
), and a lot of modern processors have an L3 cache which would hold the entire array with one byte per entry. Whether the extra cost in execution time will be greater or less than the difference in caching is hard to say; you'll have to profile to know exactly.If you can organize your algorithms to work a byte (or even a word) at a time, the cost of access may be much less. Iterating over the entire array, for example:
could make a significant difference. Most compilers will be able to keep
w1
andw2
in registers, meaning you'll only have 1/4 as many memory accesses. Packing withunsigned int` would probably be even faster.