Address canonical form and pointer arithmetic

On AMD64 compliant architectures, addresses need to be in canonical form before being dereferenced.

From the Intel manual, section 3.3.7.1:

In 64-bit mode, an address is considered to be in canonical form if address bits 63 through to the most-significant implemented bit by the microarchitecture are set to either all ones or all zeros.

Now, the most significat implemented bit on current operating systems and architectures is the 47th bit. This leaves us with a 48-bit address space.

Especially when ASLR is enabled, user programs can expect to receive an address with the 47th bit set.

If optimizations such as pointer tagging are used and the upper bits are used to store information, the program must make sure the 48th to 63th bits are set back to whatever the 47th bit was before dereferencing the address.

But consider this code:

int main()
{
    int* intArray = new int[100];

    int* it = intArray;

    // Fill the array with any value.
    for (int i = 0; i < 100; i++)
    {
        *it = 20;
        it++;   
    }

    delete [] intArray;
    return 0;
}

Now consider that intArray is, say:

0000 0000 0000 0000 0111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1100

After setting it to intArray and increasing it once, and considering sizeof(int) == 4, it will become:

0000 0000 0000 0000 1000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000

The 47th bit is in bold. What happens here is that the second pointer retrieved by pointer arithmetic is invalid because not in canonical form. The correct address should be:

1111 1111 1111 1111 1000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000

How do programs deal with this? Is there a guarantee by the OS that you will never be allocated memory whose address range does not vary by the 47th bit?

The canonical address rules mean there is a giant hole in the 64-bit virtual address space. 2^47-1 is not contiguous with the next valid address above it, so a single mmap won't include any of the unusable range of 64-bit addresses.

+----------+
| 2^64-1   |   0xffffffffffffffff
| ...      |
| 2^64-2^47|   0xffff800000000000
+----------+
|          |
| unusable |
|          |
+----------+
| 2^47-1   |   0x00007fffffffffff
| ...      |
| 0        |   0x0000000000000000
+----------+

In other words:

Is there a guarantee by the OS that you will never be allocated memory whose address range does not vary by the 47th bit?

Yes. The 48-bit address space supported by current hardware is an implementation detail. The canonical-address rules ensure that future systems can support more virtual address bits without breaking backwards compatibility to any significant degree. You'd just need a compat flag to have the OS not give the process any memory regions with high bits not all the same. Future hardware won't need to support any kind of flag to ignore high address bits or not, because junk in the high bits is currently an error.

Fun fact: Linux defaults to mapping the stack at the top of the lower range of valid addresses.

e.g.

$ gdb /bin/ls
...
(gdb) b _start
Function "_start" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (_start) pending.
(gdb) r
Starting program: /bin/ls

Breakpoint 1, 0x00007ffff7dd9cd0 in _start () from /lib64/ld-linux-x86-64.so.2
(gdb) p $rsp
$1 = (void *) 0x7fffffffd850
(gdb) exit

$ calc
2^47-1
              0x7fffffffffff