Why are types always a certain size no matter its

2020-05-14 21:13发布

Implementations might differ between the actual sizes of types, but on most, types like unsigned int and float are always 4 bytes. But why does a type always occupy a certain amount of memory no matter its value? For example, if I created the following integer with the value of 255

int myInt = 255;

Then myInt would occupy 4 bytes with my compiler. However, the actual value, 255 can be represented with only 1 byte, so why would myInt not just occupy 1 byte of memory? Or the more generalized way of asking: Why does a type have only one size associated with it when the space required to represent the value might be smaller than that size?

标签: c++
19条回答
戒情不戒烟
2楼-- · 2020-05-14 21:54

Because it would be very complicated and computation heavy to have simple types with dynamic sizes. I'm not sure it this would be even possible.
Computer would have to check how many bits the number takes after every change of its value. It would be quite a lot additional operations. And it would be much harder to perform calculations when you don't know sizes of variables during the compilation.

To support dynamic sizes of variables, computer would actually have to remember how many bytes a variable has right now which ... would require additional memory to store that information. And this information would have to be analyzed before every operation on the variable to choose the right processor instruction.

To better understands how computer works and why variables has constant sizes, learn basics of assembler language.

Although, I suppose it would be possible to achieve something like that with constexpr values. However, this would make the code less predictable for a programmer. I suppose that some compiler optimizations may do something like that but they hide it from a programmer to keep things simple.

I described here only the problems that concerns performance of a program. I omitted all problems that would have to be solved to save memory by reducing sizes of variables. Honestly, I don't think that it is even possible.


In conclusion, using smaller variables than declared has sense only if their values are known during the compilation. It is quite probable that modern compilers do that. In other cases it would cause too many hard or even unsolvable problems.

查看更多
疯言疯语
3楼-- · 2020-05-14 21:55

Because in a language like C++, a design goal is that simple operations compile down to simple machine instructions.

All mainstream CPU instruction sets work with fixed-width types, and if you want to do variable-width types, you have to do multiple machine instructions to handle them.

As for why the underlying computer hardware is that way: It's because it's simpler, and more efficient for many cases (but not all).

Imagine the computer as a piece of tape:

| xx | xx | xx | xx | xx | xx | xx | xx | xx | xx | xx | xx | xx | ...

If you simply tell the computer to look at the first byte on the tape, xx, how does it know whether or not the type stops there, or proceeds on to the next byte? If you have a number like 255 (hexadecimal FF) or a number like 65535 (hexadecimal FFFF) the first byte is always FF.

So how do you know? You have to add additional logic, and "overload" the meaning of at least one bit or byte value to indicate that the value continues to the next byte. That logic is never "free", either you emulate it in software or you add a bunch of additional transistors to the CPU to do it.

The fixed-width types of languages like C and C++ reflect that.

It doesn't have to be this way, and more abstract languages which are less concerned with mapping to maximally efficient code are free to use variable-width encodings (also known as "Variable Length Quantities" or VLQ) for numeric types.

Further reading: If you search for "variable length quantity" you can find some examples of where that kind of encoding is actually efficient and worth the additional logic. It's usually when you need to store a huge amount of values which might be anywhere within a large range, but most values tend towards some small sub-range.


Note that if a compiler can prove that it can get away with storing the value in a smaller amount of space without breaking any code (for example it's a variable only visible internally within a single translation unit), and its optimization heuristics suggest that it'll be more efficient on the target hardware, it's entirely allowed to optimize it accordingly and store it in a smaller amount of space, so long as the rest of the code works "as if" it did the standard thing.

But, when the code has to inter-operate with other code that might be compiled separately, sizes have to stay consistent, or ensure that every piece of code follows the same convention.

Because if it's not consistent, there's this complication: What if I have int x = 255; but then later in the code I do x = y? If int could be variable-width, the compiler would have to know ahead of time to pre-allocate the maximum amount of space it'll need. That's not always possible, because what if y is an argument passed in from another piece of code that's compiled separately?

查看更多
可以哭但决不认输i
4楼-- · 2020-05-14 21:56

Java uses classes called "BigInteger" and "BigDecimal" to do exactly this, as does C++'s GMP C++ class interface apparently (thanks Digital Trauma). You can easily do it yourself in pretty much any language if you want.

CPUs have always had the ability to use BCD (Binary Coded Decimal) which is designed to support operations of any length (but you tend to manually operate on one byte at a time which would be SLOW by today's GPU standards.)

The reason we don't use these or other similar solutions? Performance. Your most highly performant languages can't afford to go expanding a variable in the middle of some tight loop operation--it would be very non-deterministic.

In mass storage and transport situations, packed values are often the ONLY type of value you would use. For example, a music/video packet being streamed to your computer might spend a bit to specify if the next value is 2 bytes or 4 bytes as a size optimization.

Once it's on your computer where it can be used though, memory is cheap but the speed and complication of resizable variables is not.. that's really the only reason.

查看更多
Explosion°爆炸
5楼-- · 2020-05-14 21:58

Something simple which most answers seem to miss:

because it suits the design goals of C++.

Being able to work out a type's size at compile time allows a huge number of simplifying assumptions to be made by the compiler and the programmer, which bring a lot of benefits, particularly with regards to performance. Of course, fixed-size types have concomitant pitfalls like integer overflow. This is why different languages make different design decisions. (For instance, Python integers are essentially variable-size.)

Probably the main reason C++ leans so strongly to fixed-size types is its goal of C compatibility. However, since C++ is a statically-typed language which tries to generate very efficient code, and avoids adding things not explicitly specified by the programmer, fixed-size types still make a lot of sense.

So why did C opt for fixed-size types in the first place? Simple. It was designed to write '70s-era operating systems, server software, and utilities; things which provided infrastructure (such as memory management) for other software. At such a low level, performance is critical, and so is the compiler doing precisely what you tell it to.

查看更多
▲ chillily
6楼-- · 2020-05-14 21:59

The compiler is supposed to produce assembler (and ultimately machine code) for some machine, and generally C++ tries to be sympathetic to that machine.

Being sympathetic to the underlying machine means roughly: making it easy to write C++ code which will map efficiently onto the operations the machine can execute quickly. So, we want to provide access to the data types and operations that are fast and "natural" on our hardware platform.

Concretely, consider a specific machine architecture. Let's take the current Intel x86 family.

The Intel® 64 and IA-32 Architectures Software Developer’s Manual vol 1 (link), section 3.4.1 says:

The 32-bit general-purpose registers EAX, EBX, ECX, EDX, ESI, EDI, EBP, and ESP are provided for holding the following items:

• Operands for logical and arithmetic operations

• Operands for address calculations

• Memory pointers

So, we want the compiler to use these EAX, EBX etc. registers when it compiles simple C++ integer arithmetic. This means that when I declare an int, it should be something compatible with these registers, so that I can use them efficiently.

The registers are always the same size (here, 32 bits), so my int variables will always be 32 bits as well. I'll use the same layout (little-endian) so that I don't have to do a conversion every time I load a variable value into a register, or store a register back into a variable.

Using godbolt we can see exactly what the compiler does for some trivial code:

int square(int num) {
    return num * num;
}

compiles (with GCC 8.1 and -fomit-frame-pointer -O3 for simplicity) to:

square(int):
  imul edi, edi
  mov eax, edi
  ret

this means:

  1. the int num parameter was passed in register EDI, meaning it's exactly the size and layout Intel expect for a native register. The function doesn't have to convert anything
  2. the multiplication is a single instruction (imul), which is very fast
  3. returning the result is simply a matter of copying it to another register (the caller expects the result to be put in EAX)

Edit: we can add a relevant comparison to show the difference using a non-native layout makes. The simplest case is storing values in something other than native width.

Using godbolt again, we can compare a simple native multiplication

unsigned mult (unsigned x, unsigned y)
{
    return x*y;
}

mult(unsigned int, unsigned int):
  mov eax, edi
  imul eax, esi
  ret

with the equivalent code for a non-standard width

struct pair {
    unsigned x : 31;
    unsigned y : 31;
};

unsigned mult (pair p)
{
    return p.x*p.y;
}

mult(pair):
  mov eax, edi
  shr rdi, 32
  and eax, 2147483647
  and edi, 2147483647
  imul eax, edi
  ret

All the extra instructions are concerned with converting the input format (two 31-bit unsigned integers) into the format the processor can handle natively. If we wanted to store the result back into a 31-bit value, there would be another one or two instructions to do this.

This extra complexity means you'd only bother with this when the space saving is very important. In this case we're only saving two bits compared to using the native unsigned or uint32_t type, which would have generated much simpler code.


A note on dynamic sizes:

The example above is still fixed-width values rather than variable-width, but the width (and alignment) no longer match the native registers.

The x86 platform has several native sizes, including 8-bit and 16-bit in addition to the main 32-bit (I'm glossing over 64-bit mode and various other things for simplicity).

These types (char, int8_t, uint8_t, int16_t etc.) are also directly supported by the architecture - partly for backwards compatibility with older 8086/286/386/etc. etc. instruction sets.

It's certainly the case that choosing the smallest natural fixed-size type that will suffice, can be good practice - they're still quick, single instructions loads and stores, you still get full-speed native arithmetic, and you can even improve performance by reducing cache misses.

This is very different to variable-length encoding - I've worked with some of these, and they're horrible. Every load becomes a loop instead of a single instruction. Every store is also a loop. Every structure is variable-length, so you can't use arrays naturally.


A further note on efficiency

In subsequent comments, you've been using the word "efficient", as far as I can tell with respect to storage size. We do sometimes choose to minimize storage size - it can be important when we're saving very large numbers of values to files, or sending them over a network. The trade-off is that we need to load those values into registers to do anything with them, and performing the conversion isn't free.

When we discuss efficiency, we need to know what we're optimizing, and what the trade-offs are. Using non-native storage types is one way to trade processing speed for space, and sometimes makes sense. Using variable-length storage (for arithmetic types at least), trades more processing speed (and code complexity and developer time) for an often-minimal further saving of space.

The speed penalty you pay for this means it's only worthwhile when you need to absolutely minimize bandwidth or long-term storage, and for those cases it's usually easier to use a simple and natural format - and then just compress it with a general-purpose system (like zip, gzip, bzip2, xy or whatever).


tl;dr

Each platform has one architecture, but you can come up with an essentially unlimited number of different ways to represent data. It's not reasonable for any language to provide an unlimited number of built-in data types. So, C++ provides implicit access the platform's native, natural set of data types, and allows you to code any other (non-native) representation yourself.

查看更多
Explosion°爆炸
7楼-- · 2020-05-14 21:59

Because types fundamentally represent storage, and they are defined in terms of maximum value they can hold, not the current value.

The very simple analogy would be a house - a house has a fixed size, regardless of how many people live in it, and there is also a building code which stipulates the maximum number of people who can live in a house of a certain size.

However, even if a single person is living in a house which can accommodate 10, the size of the house is not going to be affected by the current number of occupants.

查看更多
登录 后发表回答