What is the actual relation between assembly, machine code, bytecode, and opcode?
I have read most of the SO questions about assembly and machine code, such as this, but they are too high level and do not show examples of actual assembly code being transformed into machine code. As a result, I still don't understand how it works at a deeper level.
The ideal answer to this question would show a specific example of some assembly code, such as the snippet below, and how each assembly instruction gets mapped to machine code, bytecode, and/or opcode. An answer like this would be very helpful to future people learning assembly, because so far in the past few days of digging I haven't found any clear summary.
The main things I am looking for are:
- a snippet of assembly code
- a snippet of machine code
- a mapping between the snippet of assembly and machine code (how to do that mapping, or at least some general examples, and how do you know how to do this, where is all this information on the web)
- how to interpret the machine code (like are opcodes somehow related, and where is all the information on the web about what all those numbers mean)
Note: I don't have a computer science background, so I have just been slowly going lower level over the past several years and have now gotten to the point of wanting to understand assembly and machine code.
Relation Between Assembly and Machine Code
My current understanding is that an "assembler" (such as NASM) takes assembly code and creates machine code from it.
So when you compile some assembly such as this example.asm
:
global main
section .text
main:
call write
write:
mov rax, 0x2000004
mov rdi, 1
mov rsi, message
mov rdx, length
syscall
section .data
message: db 'Hello, world!', 0xa
length: equ $ - message
(compile it with nasm -f macho64 -o example.o example.asm
). It outputs this example.o
object file:
cffa edfe 0700 0001 0300 0000 0100 0000
0200 0000 0001 0000 0000 0000 0000 0000
1900 0000 e800 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
2e00 0000 0000 0000 2001 0000 0000 0000
2e00 0000 0000 0000 0700 0000 0700 0000
0200 0000 0000 0000 5f5f 7465 7874 0000
0000 0000 0000 0000 5f5f 5445 5854 0000
0000 0000 0000 0000 0000 0000 0000 0000
2000 0000 0000 0000 2001 0000 0000 0000
5001 0000 0100 0000 0005 0080 0000 0000
0000 0000 0000 0000 5f5f 6461 7461 0000
0000 0000 0000 0000 5f5f 4441 5441 0000
0000 0000 0000 0000 2000 0000 0000 0000
0e00 0000 0000 0000 4001 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0200 0000 1800 0000
5801 0000 0400 0000 9801 0000 1c00 0000
e800 0000 00b8 0400 0002 bf01 0000 0048
be00 0000 0000 0000 00ba 0e00 0000 0f05
4865 6c6c 6f2c 2077 6f72 6c64 210a 0000
1100 0000 0100 000e 0700 0000 0e01 0000
0500 0000 0000 0000 0d00 0000 0e02 0000
2000 0000 0000 0000 1500 0000 0200 0000
0e00 0000 0000 0000 0100 0000 0f01 0000
0000 0000 0000 0000 0073 7461 7274 0077
7269 7465 006d 6573 7361 6765 006c 656e
6774 6800
(that is the entire contents of example.o
). When you then "link" that using ld -o example example.o
, it gives you more machine code:
cffa edfe 0700 0001 0300 0080 0200 0000
0d00 0000 7803 0000 8500 0000 0000 0000
1900 0000 4800 0000 5f5f 5041 4745 5a45
524f 0000 0000 0000 0000 0000 0000 0000
0010 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 1900 0000 9800 0000
5f5f 5445 5854 0000 0000 0000 0000 0000
0010 0000 0000 0000 0010 0000 0000 0000
... 523 lines of this
But how did it go from assembly instructions, to those numbers? Is there some sort of standard reference that lists out all of those numbers, and what they mean, for whatever architecture you are on (I am using x86-64 through NASM on OSX), and how each set of numbers maps to each assembly instruction?
I understand that machine code is different for every machine, and there are dozens if not hundreds of different types of machines. So I am not currently looking for how assembly gets transformed to every one (that would be complicated). I just am interested in an example that illustrates how the transformation works, and any architecture can serve as the example. And from that point, I could go and research the specific architecture I am interested in and find the mapping.
Relation Between Assembly and Bytecode (or is it called "opcode"?)
So from my reading so far, assembly gets transformed into machine code as demonstrated above.
But now I get confused. I see people talk about bytecode, such as in this SO answer, showing stuff like this:
void myfunc(int a) { printf("%s", a); }
The assembly for this function would look like this:
OP Params OpName Description 13 82 6a PushString 82 means string, 6a is the address of "%s" So this function pushes a pointer to "%s" on the stack. 13 83 00 PushInt 83 means integer, 00 means the one on the top of the stack. So this function gets the integer at the top of the stack, And pushes it on the stack again 17 13 88 Call 1388 is printf, so this calls the printf function 03 02 Pop This pops the two things we pushed back off the stack 02 Return This returns to the calling code.
So then I get confused. Doing some digging, I can't tell if each of those 2-digit hex numbers like 13 82 6a
are each, individually, called "opcodes", and the whole set of them is called "bytecode" as a catch-all term. In addition, I can't find a table that lists out all of these 2-digit hex numbers, and what their relation is to machine code, or assembly.
To summarize, I am very much looking forward to an example showing how assembly instructions map to machine code, and it's relation to bytecode and/or opcode. (I am not looking for how a compiler does this, just how the general mapping works). I think this would clarify it for not only myself but for many people down the road who are interested in learning more about the bare metal.
One other reason why this would be valuable to know is, so one can understand how the LLVM compiler generates machine code. Do they have some sort of "complete list" of 2-digit opcodes or machine code 4-digit sequences, and know exactly how that maps to any architecture-specific assembly? Where did they get that information from? An answer to this overall question would make it much clearer how LLVM implemented its code generation.
Update
Updating from @HansPassant's comment. I actually don't care what the actual distinctions are between the words, sorry if that wasn't clear. I just want to know this: how does assembly map to machine code (and where are places to begin looking for the references that hold that information on the web), and are opcodes or bytecode used anywhere in that process? And if so how?
Yes, each architecture has an instruction set reference that gives how instructions are encoded. For x86, it's the Intel® 64 and IA-32 Architectures Software Developer's Manual Volume 2 (2A, 2B & 2C): Instruction Set Reference, A-Z
Most assemblers, including
nasm
, can produce a listing file for you. Feeding your sample code tonasm -l
, we get:You can see the generated machine code in the third column (first is line number, second is address).
Note that the output of the assembler is an object file, and the output of the linker is an executable. Both of those have a complex structure and contain more than just the machine code. This is why your hexdump differs from the above listing.
Opcode is generally considered to be the part of the machine code instruction that specifies the operation to perform. For example, in the above code you have
B804000002 mov rax, 0x2000004
. ThereB8
is the opcode,04000002
is the immediate operand.Bytecode is not typically used in the assembly context, it could be thought of as the machine code for a virtual machine.
For a walkthrough, x86 is a very complicated architecture. But your sample code happens to have a simple instruction, the
syscall
. So let's see how to turn that into machine code. Open the above mentioned reference pdf, and go to the section aboutsyscall
in chapter 4. You will immediately see it listed as opcode0F 05
. Since it doesn't take any operands, we are done, those 2 bytes are the machine code. How do we turn it back? Go toAppendix A: Opcode map
. SectionA.1
tells us:For 2-byte opcodes beginning with 0FH (Table A-3), skip any instruction prefixes, the 0FH byte (0FH may be preceded by 66H, F2H, or F3H) and use the upper and lower 4-bit values of the next opcode byte to index table rows and columns.
. Okay so we skip the0F
and split the05
into0
and5
and look that up in tableA-3
in row #0, column #5. We find it is asyscall
instruction.Briefly:
"Assembly" is what you feed through an "assembler". An assembler is a program which reads in several decks of punched cards and "assembles" them into a single program.
Or at least that used to be. Now the cards are replaced with disk files. But the data on the "cards" is a "machine language" which is the numeric values for the machine instructions.
But modern assemblers are SAPs -- Symbolic Assembler Programs -- so you can replace the numeric values with symbols -- say "LOD" for a Load instruction, "R1" for register 1, and "label5" for the instruction address 26734.
"Machine language" is the way that individual instructions (or "orders", if you're a Brit) to the CPU are represented. For a symbolic assembler you might have "LOD R1, LOOPCOUNT" to represent the instruction to load the value at the word labeled LOOPCOUNT into register 1. "LOD", by the way, is the "opcode" -- the (symbolic version of the) numeric value that tells the computer what to do next. (And note that every different computer design uses a different machine language, possibly with different symbols for the opcodes. Most of what you will find on the web is one version or another of the Intel machine language, but you would find, say, the IBM 370 to be radically different.)
"Bytecode" is a different sort of "machine language" which operates on a "virtual machine" instead of real hardware. The best known case of this is the Java Virtual Machine. "Bytecode" is a notation similar to regular "machine language" but idealized to an extent, since running on a virtual machine relieves it from some of the realities of a real hardware environment.
The relationship is:
The assembler instruction is human readable code, such as:
mov rax, 0x2000004
The opcode is the part of the machine code that relates to the instruction, but from the CPU point of view (so it's not just MOV, but MOV constant to register). For example, see here for i386 MOV opcodes:
MOV reg32, immediate value
is coded asB8
+ register code (AX is the first one so it's 0),04 00 00 02
Byte-code is the equivalent of machine code but for virtual machines such as the JVM. The term bytecode codes from the first environments that used this technology (p-code from the UCSD pascal compiler), which used a byte to encode the virtual instruction. You can find for example the small p-code insruction set here, and the more recent and extensive JVM bytecode here
To be noted: LLVM use an intermediate format (IF) that is stored in a compacted form also known as bytecode. This allows to perform machine neutral code analysizs optimizing before generating native code
You have clearly done some homework of your own on this, and I say good stuff (and voted you up one).
As you are experiencing, the more you read, the more you say, "huh ?"
Okay, first off, when you encounter the word "bytecode" just close the window and stop reading, because you are on the wrong path; probably a tangent at best and at worst you could be reading someone trying to sound smarter than he really is by tossing techhy sounding buzzwords into his writing.
Now, as for the word "opcode", yes those really do exist, but do understand that those numbers are actually symbolic, for humans to grasp conceptually. In real life, they are super-ultra-tiny switches.
If you really like history, and technology before the internet (or color TV for that matter) look up phrases like butterfly switches, vacuum tubes, butterfly girls, and I forget the other words. This was back before transistors existed. The original huge computers actually used vacuum tubes and generated enough heat to warm an entire floor (or two or three) of an office building in the dead of Winter. The electrical current draws were astounding.
The thing to keep in your mind about all this is that those computers were "programmed" by individually flipping butterfly switches ("bat handles" were another term sometimes used) which connected and disconnected individual lines from individual tubes, and I forget what else.
The facts were: You programmed a computer by flipping the bat handles that were connected to the lines that were connected to various tubes.
Fast Forward To Today...
When you write an opcode of 90h, (I think that's a NOP in x86, somebody correct me and I'll fix it) you are doing (with today's hi-tech wowee-zowee) the same thing that the butterfly girls did back in the stone age of computers.
Specifically, you are "throwing" these "butterfly switches"...
Here's the big difference (and part of today's hi-tech wowee-zowee)...
They had to throw exactly those switches at exactly one place on the floor. You will be flipping them anywhere you want. Three other programs will cooperate and make those decisions for you.
Those three programs are - The Assembler - The Linker - The Loader
So then (I hope) that this has helped lay the foundation for you to understand that the OPCODE is a mental representation of a bunch of little switches that will be "opened" or "closed".
(Actually, the hi-tech wowee-zowee has taken it a step further, but it's the same effect as the butterfly switches of previous gnerations.)
Anyway, it works like this.
Humans decided that there would be an instruction to do nothing; called a
NOP
So, you type the letters
NOP
in your text editor like thisYou then save the file.
You then ask the assembler to assemble that file
When the assembler sees the
NOP
he creates the90
(in hex) in the Object file which he is creating for the linker.The Linker uses the object file and creates an executable file
The Loader places that executable file wherever it wants. (Note, in olden days of microcomputers, the software writer had to decide where to place that executable file; that was conflict bait like you wouldn't believe.)
Anyway, the
NOP
became90
in some place in theEXE
file and the loader stuck it in a good area for you, based on 179 rules you don't have to worry about any longer.The loader then gets out of the picture and lets your program have the CPU.
The CPU fetches your first instruction and starts obeying.
When the CPU gets to the byte containing
90
it will be the same thing as the butterfly switches from generations past.While the current will not be traveling a bunch of long wires on the floor, it will be doing highly similar (and functionally equivalent) things inside the ASIC.
Now with all that written (thanks if you're still actually reading) you can understand this boiled down one line explanation of what an opcode actually is...
The opcode is a paradigmatic representation of butterfly switches of olden days.
Now for your second question about what is machine code.
Machine code is a bunch of opcodes
If any of this is unclear, ask in the comments section and I'll try to edit this answer.
Yes, though they can be very complex. Also, due to the prevalence of assemblers and compilers, they're also sort of hard to find, because pretty much nobody uses them.
13
tells the processor to push a string onto the stack.13
.PushString
maps to machine instruction13
.I should note that the bytecode instructions used in this post and in my other post that you linked to are simplified extracts from a proprietary byte code I work with at my company. We have a proprietary programming language that compiles to this bytecode which is interpreted by our product, and some of the values I mentioned are real bytecodes we actually use.
13
is actuallypushAnything
with complex parameters, but I kept things simple for the answer.Assembly: Human readable instructors to the assembler + data bytes + operators
Machine code: The actual bit sequences that the CPU understands.
It contains:
Bytecode: This is the code read by a interpreter (most implementations of java are actually an interpreter that reads bytecode and uses that bytecode to select a sequence of machine code to have the CPU actually execute). Bytecode is often used to make the same source code work on several different CPUs.
Opcode: The first one (or two) bytes of the machine code. It acts like a selector to tell the CPU which microcode sequence the CPU it is to perform (something like a switch statement in C)
Microcode: The hardwired instruction sequences within the CPU that are used to execute the machine code.
There are lots of microcode sequences, at least one sequence for each opcode. In general, the rest of the machine code is just parameters to the microcode sequence that is selected by the opcode each microcode sequence contains instructions to open/close gates, clock data, pass info to/from the accumulator, etc