On 32-bit CPUs, is an 'integer' type more

2019-01-07 21:32发布

问题:

On a 32-bit CPU, an integer is 4 bytes and a short integer is 2 bytes. If I am writing a C/C++ application that uses many numeric values that will always fit within the provided range of a short integer, is it more efficient to use 4 byte integers or 2 byte integers?

I have heard it suggested that 4 byte integers are more efficient as this fits the bandwidth of the bus from memory to the CPU. However, if I am adding together two short integers, would the CPU package both values in a single pass in parallel (thus spanning the 4 byte bandwidth of the bus)?

回答1:

Yes, you should definitely use a 32 bit integer on a 32 bit CPU, otherwise it may end up masking off the unused bits (i.e., it will always do the maths in 32 bits, then convert the answer to 16 bits)

It won't do two 16 bit operations at once for you, but if you write the code yourself and you're sure it won't overflow, you can do it yourself.

Edit: I should add that it also depends somewhat on your definition of "efficient". While it will be able to do 32-bit operations more quickly, you will of course use twice as much memory.

If these are being used for intermediate calculations in an inner loop somewhere, then use 32-bit. If, however, you're reading this from disk, or even if you just have to pay for a cache miss, it may still work out better to use 16-bit integers. As with all optimizations, there's only one way to know: profile it.



回答2:

If you have a large array of numbers, then go with the smallest size that works. It will be more efficient to work with an array of 16 bit shorts than 32 bit ints since you get twice the cache density. The cost of any sign extension the CPU has to do to work with 16 bit values in 32 bit registers is trivially negligible compared to the cost of a cache miss.

If you are simply using member variables in classes mixed with other data types then it is less clear cut as the padding requirements will likely remove any space saving benefit of the 16 bit values.



回答3:

If you're using "many" integer values, the bottleneck in your processing is liable to be bandwidth to memory. 16 bit integers pack more tightly into the data cache, and would therefore be a performance win.

If you are number crunching on a very large amount of data, you should read What Every Programmer Should Know About Memory by Ulrich Drepper. Concentrate on chapter 6, about maximizing the efficiency of the data cache.



回答4:

A 32 bit CPU is a CPU that usually operates on 32 bit values internally, yet that does not mean that it is any slower when performing the same operation on a 8/16 bit value. x86 for example, still backward compatible up to the 8086, can operate on fractions of a register. That means even if a register is 32 bit wide, it can operate only on the first 16 or the first 8 bit of that register and there will be no slow down at all. This concept has even been adopted by x86_64, where the registers are 64 bit, yet they still can operate only on the first 32, 16, or 8 bit.

Also x86 CPUs always load a whole cache line from memory, if not already in cache, and a cache line is bigger than 4 byte anyway (for 32 bit CPUs rather 8 or 16 bytes) and thus loading 2 byte from memory is equally fast as loading 4 byte from memory. If processing many values from memory, 16 bit values may actually be much faster than 32 bit values, since there are less memory transfers. If a cache line is 8 byte, there are four 16 bit values per cache line, yet only two 32 bit values, thus when using 16 bit ints you have one memory access every four values, using 32 bit ints you have one every two values, resulting in twice as many transfers for processing a large int array.

Other CPUs, like PPC for example, cannot process only a fraction of a register, they always process the full register. Yet these CPUs usually have special load operations that allow them to, e.g. load a 16 bit value from memory, expand it to 32 bit and write it to a register. Later on they have a special store operation that takes the value from the register and only stores the last 16 bit back to memory; both operation need only one CPU cycle, just like a 32 bit load/store would need, so there is no speed difference either. And since PPC can only perform arithmetic operations on registers (unlike x86, which can also operate on memory directly), this load/store procedure takes place anyway whether you use 32 bit ints or 16 bit ints.

The only disadvantage, if you chain multiple operations on a 32 bit CPU that can only operate on full registers, is that the 32 bit result of the last operation may have to be "cut back" to 16 bit before the next operation is performed, otherwise the result may not be correct. Such a cut back is only a single CPU cycle, though (a simple AND operation), and compilers are very good at figuring out when such a cut back is really necessary and when leaving it out won't have any influence on the final result, so such a cut back is not performed after every instruction, it is only performed if really unavoidable. Some CPUs offers various "enhanced" instructions which make such a cut back unnecessary and I've seen plenty of code in my life, where I had expected such a cut back, yet looking at the generated assembly code, the compiler found a way to avoid it entirely.

So if you expect a general rule here, I'll have to disappoint you. Neither can one say for sure that 16 bit operations are equally fast to 32 bit operations, nor can anyone say for sure that 32 bit operations will always be faster. It depends also what exactly your code is doing with those numbers and how it is doing that. I've seen benchmarks where 32 bit operations were faster on certain 32 bit CPUs than the same code with 16 bit operations, however I also already saw the opposite being true. Even switching from one compiler to another one or upgrading your compiler version may already turn everything around again. I can only say the following: Whoever claims that working with shorts is significantly slower than working with ints, shall please provide a sample source code for that claim and name CPU and compiler he used for testing, since I have never experienced anything like that within about the past 10 years. There may be some situations, where working with ints is maybe 1-5% faster, yet anything below 10% is not "significant" and the question is, is it worth to waste twice the memory in some cases only because it may buy you 2% performance? I don't think so.



回答5:

It depends. If you are CPU bound, 32 bit operations on a 32 bit CPU will be faster than 16 bit. If you are memory bound (specifically if you have too many L2 cache misses), then use the smallest data you can squeeze into.

You can find out which one you are using a profiler that will measure both CPU and L2 misses like Intel's VTune. You will run your app 2 times with the same load, and it will merge the 2 runs into one view of the hotspots in your app, and you can see for each line of code how many cycles were spent on that line. If at an expensive line of code, you see 0 cache misses, you are CPU bound. If you see tons of misses, you are memory bound.



回答6:

Don't listen to the advice, try it.

This is probably going to depend heavily on the hardware/compiler that you're using. A quick test should make short work of this question. Probably less time to write the test than it is to write the question here.



回答7:

If you are operating on a large dataset, the biggest concern is memory footprint. A good model in this case is to assume that the CPU is infinitely fast, and spend your time worrying about how much data has to be moved to/from memory. In fact, CPUs are now so fast that it is sometimes more efficient to encode (e.g., compress) the data. That way, the CPU does (potentially much) more work (decoding/coding), but the memory bandwidth is substantially reduced.

Thus, if your dataset is large, you are probably better off using 16 bit integers. If your list is sorted, you might design a coding scheme that involves differential or run-length encoding, which will reduce memory bandwidth even more.



回答8:

When you say 32bit, I'll assume you mean x86. 16 bit arithmetic is quite slow: the operand-size prefix makes decoding really slow. So don't make your temp variables short int or int16_t.

However, x86 can efficiently load 16 and 8 bit integers into 32 or 64 bit registers. (movzx / movsx: zero and sign extension). So feel free to use short int for arrays and struct fields, but make sure you use int or long for your temp variables.

However, if I am adding together two short integers, would the CPU package both values in a single pass in parallel (thus spanning the 4 byte bandwidth of the bus)?

That is nonsense. load/store instructions interact with L1 cache, and the limiting factor is number of ops; width is irrelevant. e.g. on core2: 1 load and 1 store per cycle, regardless of width. L1 cache has a 128 or 256bit path to L2 cache.

If loads are your bottleneck, one wide load which you split up with shifts or masks after loading can help. Or use SIMD to process data in parallel without unpacking after loading in parallel.