Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
What are the main steps behind compiling a C program? By compiling, I mean (maybe wrongly) getting a binary from a plain text containing C code, using gcc.
I would love to understand some key points of the process:
By the end of the day I need to transform my C code to a language that specifically my CPU should understand. So, who cares about knowing my CPU-specific instructions? The operating system?
Is gcc converting any C to assembly language?
I know (actually guess) that for each processor type I will need an assembler that will interpret (?) the assembly code and translate to my CPU specific instructions. Where is this assembler (who ships it)? Does it comes with the OS?
Why exactly I can't see the 0s and 1s if I open the binary file with a text editor?
By the end of the day I need to transform my C code to a language that specifically my CPU should understand. So, who cares about knowing my CPU-specific instructions? The operating system?
You are not very clear here. If you are asking, which tool has knowledge of your CPU specific instructions, it's the assembler, disassembler, debugger, and maybe some others. They can generate machine code or convert it back to disassembly.
If you are asking who cares about which instructions are used, it's the processor that needs to execute them, as each instruction set represents even such common instruction as "add two integers" in completely different manner.
Is gcc converting any C to assembly language?
Yes, C (or program in any other supported language) is converted to assembly by GCC. There are many steps involved, and at least two additional internal representations used in process. Details are explained in GCC internals document. Finally compiler "backend" generates assembly representation of simple "patterns", generated by previous compiler passes. You can ask GCC to output this assembly by using -S flag. If you don't specifically ask for it, next step (assembling) is automatically executed and you only see your final executable file.
I know (actually guess) that for each processor type I will need an assembler that will interpret (?) the assembly code and translate to my CPU specific instructions. Where is this assembler (who ships it)? Does it comes with the OS?
First take note that assembly languages for each CPU differ, as they are supposed to represent CPU's machine language 1:1. Assembler then translated assembly code into machine code. Who ships it? Anyone who builds it. With GNU toolchain it's part of binutils package and it's usually installed by default on most Linux distributions. This is not only assembler available. Also note, that although GNU "suite" (GCC/binutils/gdb) support many architectures, you need to use appropriate port for your architecture. Your desktop PC's default assembler for example can not compile/assemble into ARM machine code.
Why exactly I can't see the 0s and 1s if I open the binary file with a text editor?
Because text editor is supposed to show text representation of that 0s and 1s. Assuming each character in file takes 8 bits they interpret each subseqent 8-bits as single character, instead of showing separate bits. If you know that in standard 8 bit ASCII letter 'A' is represented by value 65, you can also convert this back to binary: 01000001. It's a bit easier to convert hexadecimal representation back to binary. For this you can use hexdump (or similar) tool.
Lots happens :)
Here are some of the key steps (BTW, these are how I think of compilation, the following steps only have a passing resemblance to the steps defined in the standard).
The preprocessor runs on the source file.
The pre-processor does all sort of things for us, including:
- It performs tri-glyph (special three character sequences that represented some of the special symbols that early keyboards didn't have) replacement.
- It performs macro replacement (i.e.
#define
) by simple textual replacement
- It grabs any header files and copies their entire contents to where the
#include
line was.
Under Linux, the program that does this is m4
, and using gcc
you can stop after this step by using the -E
flag.
After the pre-processor runs, we have a file that contains all the information that is necessary for the parser to run and check our syntax, and emit assembly. Under Linux, the program that most likely does this is cc1
, and using gcc
you can stop after this step by using the -s
flag.
The assembly is converted into object code by, most likely, the program gas
(GNU Assembler), and using gcc
you can stop at this step by using the -c
flag.
Finally one or more object files, along with libraries, are converted into an executable by the linker. The linker under Linux is normally ld
, and using gcc
without any special flags run all the way through this.
Since you specifically mentioned 'By the end of the day I need to transform my C code to a language that specifically my CPU should understand,' I'll explain a little about how compilers work.
Typical compilers do a few things.
First, they do something called lexing. This step takes individual characters and combines them into 'tokens' which are things the next step understands. This step differentiates between language keywords (like 'for' and 'if' in C), operators (like '+'), constants (like integers and string literals), and other stuff. What exactly it differentiates depends on the language itself.
The next step is the parser, which takes the stream of tokens produced by the lexer and (commonly) converts it into something called an "Abstract Syntax Tree," or AST. The AST represents the computations done by the program with data structures that the compiler can navigate. Commonly the AST is language-independent, and compilers like GCC can parse different languages into a common AST format that the next step (the code generator) can understand.
Finally, the code-generator goes through the AST and outputs code that represents the semantics of the AST, that is, code that actually performs the computations that the AST represents.
In the case of GCC, and probably other compilers, the compiler does not actually produce machine code. Instead, it outputs assembly code that it passes to an assembler. The assembler goes through a similar process of lexing, parsing, and code-generating to actually produce machine-code. After all, an assembler is just a compiler that compiles assembly code.
In the case of C (and many others) The assembler is commonly not the final step. The assembler produces things called object files, which contain unresolved references to functions in other object files or libraries (like printf in the C standard library or functions from other C files in your project). These object files are passed to something called a 'linker' whose job it is to combine all of the object files into a single binary, and resolve all of the unresolved references in the object files.
Finally, after all of these steps, you have a complete executable binary.
Note that this is the way that GCC and many, many other compilers work, but it's not necessarily the case. Any program that you could write that accurately accepts a stream of C code and outputs a stream of some other code (assembly, machine code, even javascript) that is equivalent, is a compiler.
Also, the steps are not always completely separate. Rather than lexing and entire file, then parsing the entire result, then generating code for the entire AST, a compiler may do a bit of lexing, then start parsing when it has some tokens, then go back to lexing when the parser needs more tokens. When the parser feels it knows enough, it might do some code generation before having the lexer produce some more tokens for it.
” By the end of the day I need to transform my C code to a language that specifically my CPU should understand. So, who cares about knowing my CPU-specific instructions? The operating system?
The CPU.
But note that on a modern computer the apparently single CPU is just an illusion.
It's a good enough conceptual model for simple C programming, though.
” Is gcc converting any C to assembly language?
If you ask it to. Option -S
will generate an assembly listing. For the PC you can choose between AT&T syntax, which is ugly as sin, peppered with percent signs, and the ordinary Intel syntax. Unfortunately AT&T (selectable via -masm=att
IIRC) is the default, but you can use -masm=intel
to get ordinary assembly.
If you don't ask it to produce assembly, then gcc presumably generates object code directly from its internal abstract syntax tree (AST).
Producing assembly language as an intermediate form would just be adding complexity and inefficiency, so I highly doubt that it does that.
” I know (actually guess) that for each processor type I will need an assembler that will interpret (?) the assembly code and translate to my CPU specific instructions. Where is this assembler (who ships it)? Does it comes with the OS?
You don't need such assembler. But gcc ships with an assembler, as
. Unix-like OS-es typically have gcc
and as
bundled, while Windows does not have developer tools bundled. Microsoft's dev tools are however free for downloading, now (in the last week or so) including the full Visual Studio IDE. Microsoft's assembler is ml.exe
, and is known as MASM, the Macro Assembler (as if there were no other macro assemblers).
” Why exactly I can't see the 0s and 1s if I open the binary file with a text editor?
That depends on the text editor, although I don't know of any that can present 0s and 1s; text editors are designed to interpret bytes as text.
You can just write such a text editor if you want it.
Fair warning though: it has no practical use that I can think of.
Finally regarding the question in the title,
” What are the main steps behind compiling?
In practice there are two main steps: compilation and linking. The compilation step is further subdivided inte preprocessing and core language compilation, i.e.,
compilation → linking
… is really
(preprocessing → core language compilation) → linking
During the preprocessing source code files are combined via #include
directives. This produces a full translation unit of source code. The core language compilation translates that to an object code file, which contains machine code with some unresolved references.
Then finally the linking step combines object code files (including object code file contents in libraries) to create a single complete executable.