Are the stages of compilation of a C++ program specified by the standard?
If so, what are they?
If not, an answer for a widely-used compiler (I'd prefer MSVS) would be great.
I'm talking about preprocessing, tokenization, parsing and such. What is the order in which they are executed and what do they do in particular?
EDIT: I know what compilation, linking and preprocessing do, I'm mostly interested in the others and the order. Explanations for these are, of course, also welcomed since I might not be the only one interested in an answer.
The C++ specification is intentionally vague in many respects, mostly to remain implementation independent. A lot of the areas where the language is vague aren't a large concern anymore - for example, you can usually rely on a char being 8 bits. However, other issues such as layout of structures which use multiple inheritance is a real concern, as is the implications of virtual functions on classes. These issues impact the compatibility of code generated with different compilers. The Application Binary Interface (or ABI) of C++, isn't rigorously defined and as a result you occasionally have to dip into C where this becomes problematic. Writing a plugin interface is a good example.
Similarly, the standard doesn't give a detailed description of how a compiler should be built because there are many key decisions and features that differentiate compilers. For example, MSVC can perform partial builds (allowing edit and continue), which GCC doesn't. Generally speaking though, all compilers perform similar stages: preprocessing, syntax parsing, determining program flow, producing a symbol table, and producing a linear series of instructions which can subsequently be linked to produce an executable. Oh, and linking those object files, this is usually done by a linker.
I had a brief look, it's rather hard to find descriptions of individual compilers. I doubt there's much out there on commercial compilers like Microsoft's offering, purely for commercial reasons. GCC is your best bet, although Microsoft is happy to describe the process. This is pretty banal stuff though: compilers all work pretty much the same way. The real gold is in how they execute these stages, the algorithms and data structures they use. In that respect, I recommend this book. I bought a brand new copy for a university course a few years back, and I borrowed most of my textbooks from the library :).
The 9 so-called "phases of translation" are listed in the standard in
[lex.phases]
(2.2 in C++11, 2.1 in C++03).The detail demanded in the standard varies: preprocessing is split up into several phases, because it's important at various points in the standard exactly what has "already been done" and what is "left to do" when a particular bit of behavior is defined. So although it doesn't tell you how to write a lexer, it gives you a pretty clear roadmap.
Linking on the other hand is left mostly to the implementation to decide how it's actually achieved, because the standard doesn't care how a given name is looked up, just what it refers to.
It doesn't give any detail on parsing, either, it just says "The resulting tokens are syntactically and semantically analyzed and translated". That's because the whole of chapters 3-15 are required to fill in that detail.
It doesn't mention internal representations during parsing/translation at all, and neither does it mention optimization phases -- they're important to the design of compilers, but they're not important to the standard. Optimization can occur in different places in different compilers. For a long time, optimization was almost entirely in the compilation phase, before emitting object files, and linkers were dumb as a post. I think now serious C++ implementations can all do at least some optimization across multiple TUs. So "the others" aren't just left out of the standard, they do actually change over time.
Yes and no.
The C++ standard defines 9 "phases of translation". Quoting from the N3242 draft (10MB PDF), dated 2011-02-28 (prior to the release of the official C++11 standard), section 2.2:
As indicated by the [SNIP] markers, I haven't quoted the entire section, just enough to get the idea across.
To emphasize, compilers are not required to follow this exact model, as long as the final result is as if they did.
Phases 1-6 correspond more or less to the preprocessor, 7 to what you might normally think of as compilation, 8 deals with templates, and 9 corresponds to linking.
(C's translation phases are similar, but #8 is omitted.)