I'm a novice programmer. I just wanted to see output at different phases compilation, assembling & linking. I don't know assembly language also.
I wrote a simple program
#include <stdio.h>
int humans = 9;
int main()
{
int lions = 2;
int cubs = populate(lions);
return 0;
}
int populate(int crappyVariable)
{
return ++crappyVariable;
}
I used gcc - S sample.c
I'm surprised by the output of assembly language. I lost all the variable names & function names.
it preserved the global identifiers like humans, populate, main but it prefixed them with underscores _. So, I wont considering it as using identifiers. Anyway, point is it lost all the identifiers.
My question is how would it call functions or refer to variables?
I'm really curious about further stages of output, which would be in binary (which is not viewable).
How would be the output just after assembling & before linking? I guess it will loose even the underscore prefixed global identifiers too? Then again question is how would it call functions or refer to variables for operations?
I searched info on internet but couldn't find anything useful. May be I'm not sure what to search for. I don't want to read big books on this. But if there are any articles, tutorials which clear concepts. That would also be helpful.
I'm a novice programmer. So, it would be great you can explain in simple but technical terms.
EDIT: In response, to the comment. I broke my question into multiple questions. Here is the 2nd part of this question: not clear with the job of the linker
At the basic machine level, there are no more names, just numeric addresses for variables and code. Thus, once your code is translated to machine language, the names are gone for practical purposes.
If you compile with a "to assembler" option or disassemble code, you may see some identifiers; they're there to help you find your way around the code, as you're not expected to be computing data/code offsets in your head unnecessarily.
To answer your question about linking and such: Labels and identifiers that are only used "inside" a C program file are gone once the program is compiled to relocatable object form. However, externally defined names, such as main()
are needed because external modules will reference them; so a compiled object file will contain a little table listing the externally visible names and which location they refer to. A linker can then patch together external references into your module from others (and vice versa) based on those names.
After linking, even the externally defined names aren't needed any more. If you compile with debug options, tables of names may still be attached to the final program, though, so you can use those names when debugging your program.
You really need to read up on compilers and compiler design.
Start with http://www.freetechbooks.com/compiler-design-and-construction-f14.html
Here's the summary.
The goal is to get stuff copied into memory that will execute and run. Then the OS hands control over to that stuff.
The "loader" copies stuff into memory from various files. These files are actually a kind of language describing where stuff goes in memory and what goes in those places. It's a kind of "load memory" language.
The job of compiler and linker is to create files that will make the loader do the right thing.
The compiler's output is "object" files -- essentially loader instructions in many small fragmented files with many external references. The compiler's output is ideally some machine code with place-holders for external references to be plugged in. All the internal references have been resolved as offsets into heap memory or stack frames or function names.
The linker's output is larger loader files with fewer external references. It's largely the same as the compiler's output in format. But it has more stuff folded in.
Read this on the ld command: http://linux.about.com/library/cmd/blcmdl1_ld.htm
Read this on the nm command: http://linux.about.com/library/cmd/blcmdl1_nm.htm
Here's some details.
"...how would it call functions or refer to variables?"
The function names, generally, are preserved until the later stages of producing output.
The variable names are transformed into something else. "Global" variables are allocated statically and the compiler has a map from variable name to type to offset into the static ("heap") memory.
Local variables within a function are (usually) allocated in the stack frame. The compiler has a map from variable name to type to offset into the stack frame. When the function is entered, a stack frame of the required size is allocated and the variables are simply offsets into that frame.
"...how would it call functions or refer to variables for operations?"
You have to provide a hint to the compiler. The extern
keyword tells the compiler that a name is not defined in this module, but is defined in another module and the reference must be resolved at link (or load) time.
"...if there is nothing to link..."
This is never true. Your program is only one piece of the overall executable. Most C libraries include the real main program which then calls your function named "main".
"will the linker change the object code output of assembler?"
This varies a lot with OS. In many OS's the linker and the loading all happen at once. What often happens is that the output from the C compiler is thrown into an archive without having really had much resolution performed.
When the executable is loaded into memory, the archive references and any external shared object files are loaded, also.
"The program is not running, its just in the manufacturing stage."
This doesn't mean anything. Not sure why you're including this.
"How could linker map to memory? How would it look like?"
The OS will allocate a block of memory into which the executable program must be copied. The linker/loader reads the object file, any object archive files, and copies the stuff in those files into that memory. The linker does the copying and name resolution and writes a new object file that's more compiler. The loader does it into real memory and turns over execution to the resulting text page.
"Its at the run time right?"
That's the only way to debug -- run time. It can't mean anything else, or it's not debugging.
To see how local variables are handled in the assembly code, compile something like:
int main() { int foo = 42; }
What you'll notice is not just that the variable name disappears, but where the resulting data goes. You'll see something like:
movq %rsp, %rbp
Which sets the base pointer to the current stack pointer. Then:
movl $42, -4(%rbp)
So what this tells us is that the compiler allocates some space on the stack but leaves it unnamed. Adding more variables like foo will basically just allocate more memory under the base pointer. The variable that was "foo" is now just -4(%rbp)
.
A good next step is to run objdump -D on the generated .o and compare it with the .S version.
That gives an idea what of the .S is makeup, and what is translated into binary.
The final stage is linking, which roughly means two passes to resolve all labels between multiple .o files to addresses relative to 0 or a load address.
See the great free Linkers and Loaders book http://www.iecc.com/linker/ for more info on linking