Performance advantages of powers-of-2 sized data?

2020-02-17 03:44发布

问题:

If I have a game which has a 3D world, and the world is quite big, so needs to be split into chunks, is there a major, if any, performance advantage of having 128 byte chunks over, say 150 byte chunks? Obviously, the objects in the chunks are still a whole number of bytes in size.

i.e. Is chunks[128][128][128] faster than chunks[150][150][150] or chunks[112][112][112]? Are there any other side effects such as excessive RAM wastage afterwards? Are there any other factors that should be taken into consideration?

I just see that it's a convention to store everything in variables and arrays of sizes that are powers of 2, but I'm not sure whether there's any merit to it, and if it could be better to use more human numbers like 100 or 150.

回答1:

The other answers are indeed correct that power-of-two sized data will benefit from using shifts over multiplies.

However, there is a dark side to power-of-two size data. And it can hit you when you least expect it.

See these two question/answers:

  • Matrix multiplication: Small difference in matrix size, large difference in timings
  • Why are elementwise additions much faster in separate loops than in a combined loop?

When your datasets are powers-of-two, they are more likely to be super-aligned in memory. (meaning their addresses will likely have the same modulo over a large power-of-two.)

While this may seem desirable, they can lead to:

  • Conflict Cache Misses
  • False Aliasing Stalls (mentioned in the second link above)

If you read the two questions linked to above, you can see that alignment can cause a slow-down of more than 3x - which will likely far out-weigh any benefit you get from using shifts as opposed to multiplies.


So as with all performance questions, you need to measure, measure, measure... And be prepared to expect anything to happen.

You mention that you are representing a 3D-space - that is exactly the kind of situation that would exhibit power-of-two strided memory access that could lead to slow-downs.



回答2:

It's not exactly "faster", it rather utilises the available memory better since the hardware and the operating system manage memory in units having a size that is most likely a power of two. Allocating something that is less than a power of two will usually result in wasting memory because of alignment requirements.

If you dig deeper into allocators and OS memory managers, you will see that they manage everything in power-of-two sizes. An OS usually manages the memory of a process in terms of pages, and a page size is usually 4096 bytes nowadays. So if you want to allocate a piece that is 4000 bytes, the OS will still allocate 4096 bytes and the remaining 96 bytes will be wasted.



回答3:

If you access to the data by the following way:

chunks[150][150][150]
chucks[x][y][z] = 123;

Then processor must do multiplications (something like: z + 150 * (y + 150 * x) ... ) for getting an address.

If you use power-of-2 constants, then compiler could make some optimization, and use shiftings instead of multiplications. New CPU makes multiplications quite fast, so the effect is insignificant.

Using of big table can cause lot of cache-misses. So smaller table is probably faster than bigger, even the bigger have power-of-2 sized dimensions, and smaller not.



回答4:

Powers of two are used a lot in software because it's the number-base that computers use.

For example, OS's will allocate memory in block sizes of powers of two, the cache sizes in the processor are powers of two, address sizes are powers of two and so on.

Operations using powers of two values can also be optimised - a multiply or divide becomes a simple bit shift.

Basically ensuring everything uses powers of two might improve the performance of your software, but normally a compiler and/or OS will ensure that your data is utilised in an effective way when you use arbitrary sizes.



回答5:

It may be faster, it may be slower, it may be the same speed. It would be very hard to give the correct answer just by looking at the code. So the answer: Measure it, change the code, measure it again. If your code has to run on different computers, measure it on each.

I'd tend to assume that power-of-two alignment is often asking for severe trouble, and that using more memory than needed isn't going to help with performance. Doing lots of operations with a small part of memory that fits into some cache, then switching to the next part of memory, will often help. Accessing consecutive memory addresses will often help. Rounding up so that you can use vector operations will often help.