The question might seem odd, but I am still trying to grasp the concepts of virtual machines. I have read several answers, but I still don't get if Java bytecode (and MSIL as well) is the same as assembly language. As far as I understand both bytecode and assembly gets compiled to machine code, so speaking in terms of abstraction they are at the same level, i.e. one step above machine code. So is bytecode just an assembly language, i.e. a human readable form of machine code. If yes, then why is assembly language still used? Why not programming in bytecode (which is portable across different machines) instead of assembly language (which is specific to a single machine architecture)? Thanks
问题:
回答1:
No.
Java bytecode is binary programming language, not in "human readable form", unless you consider bunch of number readable, or you use disassembler to reverse it into the bytecode text mnemonics, or eventually the Java source form itself.
Assembly is usually text mnemonics of the actual instructions of the target machine, mapped 1:1 with each other, so one instruction in assembler source will translate directly into one machine code instruction (although some exceptions exists with some CPUs and assemblers, like for example many RISC assemblers will translate "load register with immediate value" into multiple instructions as needed - to load any immediate value, while the native machine code can load only particular bits, and you have to compose the whole value by several instructions).
Java bytecode is quite high-level abstraction language compared to most of CPUs machine codes, having very tiny overlap of the instructions and memory model. The only similarity is, that bytecode is stored in binary form, just like machine code.
edit:
The JVM is interpreter in principle, ie. it translates the bytecode on the fly into machine code. That's the thing, which is done in other languages by compiler during compile time.
The modern JVMs are not classic pure interpreters, but use "JIT" (Just In Time) compiler to compile small pieces of java bytecode into native machine code, just ahead of it's execution, using caches to avoid second compilation of already known .class files, and also using runtime tracking of performance data to better instruct JIT compiler, which bytecode should be optimized heavily (run often or inner loop), and which should be just compiled ASAP, without focus on performance.
So with modern JVM it's hard to talk about interpreters, it's quite sophisticated and complex solution. C# goes quite often even one step further, delivering sometimes part of binaries pre-compiled into machine code for common platforms (having the bytecode form only as an fallback for uncommon platforms).
None of this (not even similar) happens with machine code. It just executes on the CPU.
回答2:
An assembly language is a human-readable text language designed to be assembled into a binary. Each source line maps directly to one chunk of binary output (e.g. one variable-length x86 instruction), without depending on previous lines. (I'm not sure if Java bytecode asm is context-sensitive; I haven't used it).
e.g. mov eax, 1234
assembles to the same 5 bytes regardless of what other source lines surround it. (Ignoring named constants and assembler macros, of course).
The default meaning of "assembly language" (the one described the assembly tag wiki) is CPU machine-code assembly language, where the bytes being assembled into the output file are instructions and data for a native executable for some kind of CPU / microprocessor.
Other kinds of assembly languages exist, like java bytecode assembly where the bytes assembled into the output file are in Java .class
format, and can be run by a JVM. (@Ped7g's answer expands on this point, about how a JVM can optimize while translating Java bytecode into native machine code. This process is definitely not like assembling.)
It's all just text language to cause the assembler to assemble bytes into the output file.
You could have an assembly language for any kind of binary file format, even non-executable ones. A simple example: an assembly language for a bitmap still-image file format, where you can use named colours (like midnight blue
) for each pixel. The assembler would assemble bits (instead of only whole bytes like normal assembly languages) into the output file.
In a more complex case, you could imagine an H.264 assembly language, where you use a text syntax to describe the coding of headers and each macroblock.
In this case, you'd design the assembler to do the final CABAC or CAVLC compression of the assembled macroblock data into a bitstream, instead of describing that as part of the assembly language. It would be like an x86 assembler that produced gzipped binaries: assemble into a deflate stream.
One key feature of an assembly language is that it's close enough to the machine-code format that a disassembler can turn a binary back into asm that looks like what was assembled in the first place (but without any comments, label names, or macros, of course).
This is why C and Java are considered higher level languages than the binary/assembly their compilers produce as output.
回答3:
I see your point, and if you squint and make things a little fuzzy then yes absolutely. JAVA bytecode and Pythons equivalent and Pascals and some others perhaps are just machine code definitions and their compilers compile to that machine code. And that machine code runs natively on a machine. To date those machines are virtual, and likely always will be so that is the push back you are getting from other folks here.
Assembly is the human readable form machine code the bits and bytes that are not as easy to read by us but easy to read by machines.
JAVA, etc, machine code is in part just another instruction set, and there are other stack based instruction sets that are implemented directly in hardware, it is a very simple generic approach to an instruction set. But they have some high level system calls, way higher than even CISC, and that is where the problem comes in in implementing them in logic. No reason to even microcode something like that, the way you approach it is to create a virtual machine using the native instruction set (compiled from a non-JAVA high level language most likely).
If it were truly just another instruction set then absolutely you could create silicon for that. But even if you could, that doesnt mean we should give up all the other instruction sets we have. For starters JAVA tools seem to be proud that they dont optimize. So you start off with slow programs and only make them a little faster. When other languages on native hardware are far less costly in resources and energy. The inventors had a desire to do this, and have us believe that it happened, ARM Jazelle, and others. In the case of ARM Jazelle, the silicon is taking the bytes in logic, but it looks them up in a table which is made up of native machine code. ARM Processors that claim Jazelle support are bogus as you have to buy a software blob from ARM to make it work, and any number of other JVMs are actually faster and more efficient (sometimes for free) than the pay-for ARM Jazelle software. So that was a fail, and it was bogus anyway.
Yes it is a language that is compiled into a machine code. That machine code is executed on a machine. The difference is those machines are virtual not implemented in silicon (just like machine code that is implemented on a microcoded processor is, (grin)), and likely wont be.
回答4:
Bytecode and the assembly language are not the same things but they are a tightly related things.
Bytecode is a simplified binary language similarly to machine code. Bytecode specification describes how the program should be encoded to assure that virtual machine will correctly understand and execute it. In the same way processor specification describes so called Instruction Set Architecture (ISA) that shows how the program should be encoded in the binary machine code to assure that processor will correctly understand and execute it. So, bytecode is a machine-friendly representation of the program in form of sequence of bits.
The problem of bytecode is that while it extremely convenient for machine handling at the same time it extremely inconvenient for handling by humans. Assembly language provides a text-based and thus human-friendly equivalent of bytecode. Actually, assembly language establish the 1-to-1 mappings between instructions of bytecode in binary form and their text equivalents providing a convenient way for a programmer to read, understand and write programs in the particular bytecode (for particular processor or virtual machine). In other words both bytecode and assembly language describe the program on the same level of abstraction but in different terms.
The strict 1-to-1 mappings between bytecode instruction and statement in assembly language allow easy and unambiguous conversion of the program from the binary form to the text form and vice versa. As you could note there is a bunch of disassemblers which allow engineers to take a look under the hood of already compiled applications by converting them from the bytecode binary into the assembly language text.
The conversion of assembly text into the bytecode requires compilation. But in contrast to high-level programming languages, compilation of assembly text is very simple. Assembler consumes the program text in the statement-by-statement way. Usually assembly language specifies that each statement must be placed in a separate line of program text, hence, assembler consumes that text line by line. From each line it extract a sequence of words and punctuation characters ignoring comments and uses that set of words as a key in the mapping table to find equivalent sequence of binary bytes that represent the same instruction. That sequence of bytes is placed into the bytecode of the program. Actually, to eliminate overhead related to text parsing Java uses bytecode and does not compiles machine code directly from the assembly text during JITing.
Also, in contrast to high-level languages, compilation of bytecode from assembly language does not require complex syntax (building the abstract syntax tree) and semantic analysis as well as it does not perform optimization of produced bytecode. Assemblers are very simple in comparison to modern compilers. And in contrast to high-level programming languages, assembly language is always linked to the particular bytecode, thus to the particular processor or virtual machine. High-level languages was initially introduced as a mean of portability of programs and hence they are designed to be enough general. In contrast, programs in assembly languages are not portable, but on the other hand they provide programmers full access to the all features of the respective processor or virtual machine, while at the same time many of them are not accessible in the high-level language.
The idea employed by such programming languages as Java and C# is to preserve the portability of high-level languages but minimize the overhead of interpretation/compilation required to execute program. Because of this, they employ the virtual machines and bytecodes.
Note, that the same bytecode can be supported by multiple assembly languages, because there are could be multiple dictionaries of 1-to-1 mapping between the same instructions of bytecode to the different text strings corresponding to them. Each assembly language can provide its own variant of sequence of words to describe the same instruction in binary form. For example, take a look at the x86 assemblers. Intel uses one notation, Microsoft other notation, finally GNU assembler uses completely another notation. But all them compiles to the same machine code.