What memory address spaces are there?

2020-02-23 06:23发布

站内文章 / C++

58 0

冷血范

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

What forms of memory address spaces have been used?

Today, a large flat virtual address space is common. Historically, more complicated address spaces have been used, such as a pair of a base address and an offset, a pair of a segment number and an offset, a word address plus some index for a byte or other sub-object, and so on.

From time to time, various answers and comments assert that C/C++ pointers are essentially integers. That is an incorrect model for C/C++, since the variety of address spaces is undoubtedly the cause of some of the C rules about pointer operations. For example, not defining pointer arithmetic beyond an array simplifies support for pointers in a base and offset model. Limits on pointer conversion simplify support for address-plus-extra-data models.

That recurring assertion motivates this question. I am looking for information about the variety of address spaces to illustrate that a C/C++ pointer is not necessarily a simple integer and that the C/C++ restrictions on pointer operations are sensible given the wide variety of machines to be supported.

Useful information may include:

Examples of computer architectures with various address spaces and descriptions of those spaces.
Examples of various address spaces still in use in machines currently being manufactured.
References to documentation or explanation, especially URLs.
Elaboration on how address spaces motivate C/C++ pointer rules.

This is a broad question, so I am open to suggestions on managing it. I would be happy to see collaborative editing on a single generally inclusive answer. However, that may fail to award reputation as deserved. I suggest up-voting multiple useful contributions.

回答1:

Just about anything you can imagine has probably been used. The first major division is between byte addressing (all modern architectures) and word addressing (pre-IBM 360/PDP-11, but I think modern Unisys mainframes are still word addressed). In word addressing, char* and void* would often be bigger than an int*; even if they were not bigger, the "byte selector" would be in the high order bits, which were required to be 0, or would be ignored for anything other than bytes. (On a PDP-10, for example, if p was a char*, (int)p < (int)(p+1) would often be false, even though int and char* had the same size.)

Among byte addressed machines, the major variants are segmented and non-segmented architectures. Both are still wide spread today, although in the case of Intel 32bit (a segmented architecture with 48 bit addresses), some of the more widely used OSs (Windows and Linux) artificially restrict user processes to a single segment, simulating a flat addressing.

Although I've no recent experience, I would expect even more variety in embedded processors. In particular, in the past, it was frequent for embedded processors to use a Harvard architecture, where code and data were in independent address spaces (so that a function pointer and a data pointer, cast to a large enough integral type, could compare equal).

回答2:

I would say you are asking the wrong question, except as historical curiosity.

Even if your system happens to use a flat address space -- indeed, even if every system from now until the end of time uses a flat address space -- you still cannot treat pointers as integers.

The C and C++ standards leave all sorts of pointer arithmetic "undefined". That can impact you right now, on any system, because compilers will assume you avoid undefined behavior and optimize accordingly.

For a concrete example, three months ago a very interesting bug turned up in Valgrind:

https://sourceforge.net/p/valgrind/mailman/message/29730736/

(Click "View entire thread", then search for "undefined behavior".)

Basically, Valgrind was using less-than and greater-than on pointers to try to determine if an automatic variable was within a certain range. Because comparisons between pointers in different aggregates is "undefined", Clang simply optimized away all of the comparisons to return a constant true (or false; I forget).

This bug itself spawned an interesting StackOverflow question.

So while the original pointer arithmetic definitions may have catered to real machines, and that might be interesting for its own sake, it is actually irrelevant to programming today. What is relevant today is that you simply cannot assume that pointers behave like integers, period, regardless of the system you happen to be using. "Undefined behavior" does not mean "something funny happens"; it means the compiler can assume you do not engage in it. When you do, you introduce a contradiction into the compiler's reasoning; and from a contradiction, anything follows... It only depends on how smart your compiler is.

And they get smarter all the time.

回答3:

There are various forms of bank-switched memory.

I worked on an embedded system that had 128 KB of total memory: 64KB of RAM and 64KB of EPROM. Pointers were only 16-bit, so a pointer into the RAM could have the same value of a pointer in the EPROM, even though they referred to different memory locations.

The compiler kept track of the type of the pointer so that it could generate the instruction(s) to select the correct bank before dereferencing a pointer.

You could argue that this was like segment + offset, and at the hardware level, it essentially was. But the segment (or more correctly, the bank) was implicit from the pointer's type and not stored as the value of a pointer. If you inspected a pointer in the debugger, you'd just see a 16-bit value. To know whether it was an offset into the RAM or the ROM, you had to know the type.

For example, Foo * could only be in RAM and const Bar * could only be in ROM. If you had to copy a Bar into RAM, the copy would actually be a different type. (It wasn't as simple as const/non-const: Everything in ROM was const, but not all consts were in ROM.)

This was all in C, and I know we used non-standard extensions to make this work. I suspect a 100% compliant C compiler probably couldn't cope with this.

回答4:

From a C programmer's perspective, there are three main kinds of implementation to worry about:

Those which target machines with a linear memory model, and which are designed and/or configured to be usable as a "high-level assembler"--something the authors of the Standard have expressly said they did not wish to preclude. Most implementations behave in this way when optimizations are disabled.
Those which are usable as "high-level assemblers" for machines with unusual memory architectures.
Those which whose design and/or configuration make them suitable only for tasks that do not involve low-level programming, including clang and gcc when optimizations are enabled.

Memory-management code targeting the first type of implementation will often be compatible with all implementations of that type whose targets use the same representations for pointers and integers. Memory-management code for the second type of implementation will often need to be specifically tailored for the particular hardware architecture. Platforms that don't use linear addressing are sufficiently rare, and sufficiently varied, that unless one needs to write or maintain code for some particular piece of unusual hardware (e.g. because it drives an expensive piece of industrial equipment for which more modern controllers aren't available) knowledge of any particular architecture isn't likely to be of much use.

Implementations of the third type should be used only for programs that don't need to do any memory-management or systems-programming tasks. Because the Standard doesn't require that all implementations be capable of supporting such tasks, some compiler writers--even when targeting linear-address machines--make no attempt to support any of the useful semantics thereof. Even some principles like "an equality comparison between two valid pointers will--at worst--either yield 0 or 1 chosen in possibly-unspecified fashion don't apply to such implementations.