Implementations might differ between the actual sizes of types, but on most, types like unsigned int and float are always 4 bytes. But why does a type always occupy a certain amount of memory no matter its value? For example, if I created the following integer with the value of 255
int myInt = 255;
Then myInt
would occupy 4 bytes with my compiler. However, the actual value, 255
can be represented with only 1 byte, so why would myInt
not just occupy 1 byte of memory? Or the more generalized way of asking: Why does a type have only one size associated with it when the space required to represent the value might be smaller than that size?
There are a few reasons. One is the added complexity for handling arbitrary-sized numbers and the performance hit this gives because the compiler can no longer optimize based on the assumption that every int is exactly X bytes long.
A second one is that storing simple types this way means they need an additional byte to hold the length. So, a value of 255 or less actually needs two bytes in this new system, not one, and in the worst case you now need 5 bytes instead of 4. This means that the performance win in terms of memory used is less than you might think and in some edge cases might actually be a net loss.
A third reason is that computer memory is generally addressable in words, not bytes. (But see footnote). Words are a multiple of bytes, usually 4 on 32-bit systems and 8 on 64 bit systems. You usually can't read an individual byte, you read a word and extract the nth byte from that word. This means both that extracting individual bytes from a word takes a bit more effort than just reading the entire word and that it is very efficient if the entire memory is evenly divided in word-sized (ie, 4-byte sized) chunks. Because, if you have arbitrary sized integers floating around, you might end up with one part of the integer being in one word, and another in the next word, necessitating two reads to get the full integer.
Footnote: To be more precise, while you addressed in bytes, most systems ignored the 'uneven' bytes. Ie, address 0, 1, 2 and 3 all read the same word, 4, 5, 6 and 7 read the next word, and so on.
On an unreleated note, this is also why 32-bit systems had a max of 4 GB memory. The registers used to address locations in memory are usually large enough to hold a word, ie 4 bytes, which has a max value of (2^32)-1 = 4294967295. 4294967296 bytes is 4 GB.
It is an optimization and simplification.
You can either have fixed sized objects. Thus storing the value.
Or you can have variable sized objets. But storing value and size.
fixed sized objects
The code that manipulates number does not need to worry about size. You assume that you always use 4 bytes and make the code very simple.
Dynamic sized objects
The code the manipulates number must understand when reading a variable that it must read the value and size. Use the size to make sure all the high bits are zero out in the register.
When place the value back in memory if the value has not exceeded its current size then simply place the value back in memory. But if the value has shrunk or grown you need to move the storage location of the object to another location in memory to make sure it does not overflow. Now you have to track the position of that number (as it can move if it grows too large for its size). You also need to track all the unused variable locations so they can potentially be reused.
Summary
The code generated for fixed size objects is a lot simpler.
Note
Compression uses the fact that 255 will fit into one byte. There are compression schemes for storing large data sets that will actively use different size values for different numbers. But since this is not live data you don't have the complexities described above. You use less space to store the data at a cost of compressing/de-compressing the data for storage.
This is known as variable-length encoding, there are various encodings defined, for example VLQ. One of the most famous, however, is probably UTF-8: UTF-8 encodes code points on a variable number of bytes, from 1 to 4.
As always in engineering, it's all about trade-offs. There is no solution which has only advantages, so you have to balance advantages and trade-offs when designing your solution.
The design which was settled on was to use fixed-size fundamental types, and the hardware/languages just flew down from there.
So, what is the fundamental weakness of variable encoding, which caused it to be rejected in favor of more memory hungry schemes? No Random Addressing.
What is the index of the byte at which the 4th code point starts in a UTF-8 string?
It depends on the values of the previous code points, a linear scan is required.
Surely there are variable-length encoding schemes which are better at random-addressing?
Yes, but they are also more complicated. If there's an ideal one, I've never seen it yet.
Does Random Addressing really matters anyway?
Oh YES!
The thing is, any kind of aggregate/array relies on fixed-size types:
struct
? Random Addressing!Which means you essentially have the following trade-off:
Fixed size types OR Linear memory scans
It can be less. Consider the function:
it compiles to assembly code (g++, x64, details stripped)
Here,
bar
andbaz
end up using zero bytes to represent.Because you told it to use that much. When using an
unsigned int
, some standards dictate that 4 bytes will be used and that the available range for it will be from 0 to 4,294,967,295. If you were to use anunsigned char
instead, you would probably only be using the 1 byte that you're looking for, (depending on the standard and C++ normally uses these standards).If it weren't for these standards you'd have to keep this in mind: how is the compiler or CPU supposed to know to only use 1 byte instead of 4? Later on in your program you might add or multiply that value, which would require more space. Whenever you make a memory allocation, the OS has to find, map, and give you that space, (potentially swapping memory to virtual RAM as well); this can take a long time. If you allocate the memory before hand, you won't have to wait for another allocation to be completed.
As for the reason why we use 8 bits per byte, you can take a look at this: What is the history of why bytes are eight bits?
On a side note, you could allow the integer to overflow; but should you use a signed integer, the C\C++ standards state that integer overflows result in undefined behavior. Integer overflow
To change the size of a variable would require reallocation and this is usually not worth the additional CPU cycles compared to wasting a few more bytes of memory.
Local variables go on a stack which is very fast to manipulate when those variables do not change in size. If you decided you want to expand the size of a variable from 1 byte to 2 bytes then you have to move everything on the stack by one byte to make that space for it. That can potentially cost a lot of CPU cycles depending on how many things need to be moved.
Another way you could do it is by making every variable a pointer to a heap location, but you would waste even more CPU cycles and memory this way, actually. Pointers are 4 bytes (32 bit addressing) or 8 bytes (64 bit addressing), so you are already using 4 or 8 for the pointer, then the actual size of the data on the heap. There is still a cost to reallocation in this case. If you need to reallocate heap data, you could get lucky and have room to expand it inline, but sometimes you have to move it somewhere else on the heap to have the contiguous block of memory of the size you want.
It's always faster to decide how much memory to use beforehand. If you can avoid dynamic sizing you gain performance. Wasting memory is usually worth the performance gain. That's why computers have tons of memory. :)