When a java file is compiled, it generates a .class file. Now this .class file has the bytecode which the JVM interprets. when we open the .class file in a text editor, it is not human readable. Now to view the bytecode a disassembler like javap can be used.
My question is, why do we need to disassemble bytecode in order to view the bytecode itself?
What does the disassembler actually do, to convert the .class file into human readable format?
The Java virtual machine simulates a machine. This is why it is called a machine, despite it being a virtual one that does not exist in hardware. Thus, when thinking about the difference of the javap outout and the actual Java byte code, think about the difference between assembly and machine code:
Assembly code uses so-called mnemonics to make code human readable. Such mnemonic names are however nothing a machine can relate to because a machine only knows how to read and manipulate binary data. Thus, we have to assemble the mnemonic (and its potential arguments) using an assembler where each such mnemonic is translated into its binary equivalent. For example, for loading a value from a specific register we would write something like load 0xFF
in assembly instead of using the actual binary opcode for this instruction which might be something like 1001 1011 1111 1111
. Similarly, with Java byte code, the mnemonic being what javap produces, we need to represent binary data to the (virtual) machine which it is then is able to process. Only if we want to read the byte code, we rather disassemble it into the assembly code that javap represents.
Keep in mind: The only reason that assembly language and the javap output exists is the fact that humans such as you and me do not enjoy reading binary code. We are trained to distinguish what we see by shapes as for example letters and names. In contrast, a machine interprets data sequentially by reading a stream of bits. As mentioned, these bits are hard for us to read which is why we rather present them in hexadecimal format: Instead of 1111 1111
, we rather write 0xFF
. But this is still rather difficult to read as such a numeric value does not reveal its contextual meaning. 0xFF
could still mean about everything. This is why we rather use the mentioned mnemonics where this meaning is implicit.
You might argue that a virtual machine is still only virtual and this machine could therefore indeed interpret mnemonics rather than binary Java byte code. However, such mnemonics would consume more space (strings are of course just represented as bytes by a machine) and it also take more time to interpret than the simulated machine language that is run on the JVM. You can therefore also think about the byte code being a weird encoding compared to standard encodings such as ASCII where the charset only contains words instead of letters where the words are only those that are used and understood by the Java virtual machine. Obviously, this Java byte code charset is more efficient than using ASCII for describing the contents of a class file.
When it comes to saving data, available formats fall in two large categories:
- Text formats (such as simple text files, source code files, XML, etc), which have the advantages of being human readable and editable with simple tools, but they can only be parsed by complicated programs (the more complicated the language, the more complicated the program must be to actually understand it).
- Binary formats (such as most image formats, wave sounds, executables, bytecode files), which have the advantages of being smaller in size for the same amount of information and they don't need a complicated parser to be understood by the machine (often the data is stored in fixed-size chunks, which makes parsing them even easier).
A .class
file is primarily intended to be fed to the JVM, so it should be in the smallest and easiest-to-read possible format for the machine. If the .class
file was a text file (if the bytecode was saved in its human-readable form), parsing would be required every time the .class
file is loaded. However, this feature isn't often needed, so it would be a waste of the application's loading time to do that.
.class is just the object code code which is machine readable. If you want to see the code then you can use any decompiler like Jad Decompiler
etc.
A class file contains a bunch of commands/opcodes/data intended to be read by the JVM which, when viewed by humans is mostly just a huge bunch of numbers & embedded senseless text.
The reason why you need to disassemble to read this is because the disassembler organizes it in a way that makes sense to humans and substitutes the numbers for their textual commands (e.g. textual versions of the opcodes like aload
instead of \19
or goto
instead of \A7
) which make more sense to humans.
What the java compiler does is interpret your Java syntax and convert it to statements that the virtual machine understands. This Virtual Machine is written in a combination of C and Java. The Virtual Machine will convert the bytecode instructions to native calls for your operating system. (which is why the JVM for windows is different than the one from unix based systems)
As already stated in a comment interpreting human readable code is slower than interpreting instructions that are already partially native.