Why are compiled Java class files smaller than C c

2020-02-10 04:23发布

问题:

I would like to know why the .o file that we get from compiling a .c file that prints "Hello, World!" is larger than a Java .class file that also prints "Hello, World!"?

回答1:

Java uses Bytecode to be platform independent and "precompiled", but bytecode is used by interpreter and is served to be compact enough, so it is not the same that machine code which you can see in compiled C program. Just take a look at the full process of Java compilation:

Java program  
-> Bytecode   
  -> High-level Intermediate Representation (HIR)   
    -> Middle-level Intermediate Representation (MIR)   
      -> Low-level Intermediate Representation (LIR)  
        -> Register allocation
          -> EMIT (Machine Code)

this is the chain for Java Program to Machine code transformation. As you see bytecode is far away from machine code. I can't find in the Internet good stuff to show you this road on the real program (an example), everything I've found is this presentation, here you can see how each steps changes code presentation. I hope it answers you how and why compiled c program and Java bytecode are different.

UPDATE: All steps which are after "bytecode" are done by JVM in runtime depending on its decision to compile that code (that's another story... JVM is balancing between bytecode interpretation and its compiling to native platform dependent code)

Finally found good example, taken from Linear Scan Register Allocation for the Java HotSpot™ Client Compiler (btw good reading to understand what is going on inside JVM). Imagine that we have java program:

public static void fibonacci() {
  int lo = 0;
  int hi = 1;
  while (hi < 10000) {
    hi = hi + lo;
    lo = hi - lo;
    print(lo);
  }
}

then its bytecode is:

0:  iconst_0
1:  istore_0 // lo = 0
2:  iconst_1
3:  istore_1 // hi = 1
4:  iload_1
5:  sipush 10000
8:  if_icmpge 26 // while (hi < 10000)
11: iload_1
12: iload_0
13: iadd
14: istore_1 // hi = hi + lo
15: iload_1
16: iload_0
17: isub
18: istore_0 // lo = hi - lo
19: iload_0
20: invokestatic #12 // print(lo)
23: goto 4 // end of while-loop
26: return

each command takes 1 byte (JVM supports 256 commands, but in fact has less than that number) + arguments. Together it takes 27 bytes. I omit all stages, and here is ready to execute machine code:

00000000: mov dword ptr [esp-3000h], eax
00000007: push ebp
00000008: mov ebp, esp
0000000a: sub esp, 18h
0000000d: mov esi, 1h
00000012: mov edi, 0h
00000017: nop
00000018: cmp esi, 2710h
0000001e: jge 00000049
00000024: add esi, edi
00000026: mov ebx, esi
00000028: sub ebx, edi
0000002a: mov dword ptr [esp], ebx
0000002d: mov dword ptr [ebp-8h], ebx
00000030: mov dword ptr [ebp-4h], esi
00000033: call 00a50d40
00000038: mov esi, dword ptr [ebp-4h]
0000003b: mov edi, dword ptr [ebp-8h]
0000003e: test dword ptr [370000h], eax
00000044: jmp 00000018
00000049: mov esp, ebp
0000004b: pop ebp
0000004c: test dword ptr [370000h], eax
00000052: ret

it takes 83 (52 in hex + 1 byte) bytes in result.

PS. I don't take into account linking (was mentioned by others), as well as compiledc and bytecode file headers (probably they are different too; I don't know how is it with c, but in bytecode file all strings are moved to special header pool, and in program there is used its "position" in header etc.)

UPDATE2: Probably worth to mention, that java works with stack (istore/iload commands), though machine code based on x86 and most other platform works with registers. As you can see machine code is "full" of registers and that gives extra size to the compiled program in comparing with more simple stack-based bytecode.



回答2:

The main cause of difference in size in this case is difference in file formats. For such a small program format of the ELF (.o) file introduces serious overhead in terms of space.

For example, my sample .o file of the "Hello, world" program takes 864 bytes. It consists of (explored with readelf command):

  • 52 bytes of file header
  • 440 bytes of section headers (40 bytes x 11 sections)
  • 81 bytes of section names
  • 160 bytes of symbol table
  • 43 bytes of code
  • 14 bytes of data (Hello, world\n\0)
  • etc

.class file of the similar program takes only 415 bytes, despite the fact that it contains more symbol names and these names are long. It consists of (explored with Java Class Viewer):

  • 289 bytes of constant pool (includes constants, symbol names, etc)
  • 94 bytes of method table (code)
  • 8 bytes of attribute table (source file name reference)
  • 24 bytes of fixed-size headers

See also:

  • Executable and Linkable Format
  • Java class file
  • Java Class Viewer


回答3:

C programs, even though they're compiled to native machine code that runs on your processor (dispatched through the OS, of course), tend to need to do a lot of set up and tearing down for the operating system, loading dynamically-linked libraries like the C library, etc.

Java, on the other hand, compiles to bytecode for a virtual platform (basically a simulated computer-within-a-computer), which is specifically designed alongside Java itself, so a lot of this overhead (if it would even be necessary since both the code and the VM interface is well-defined) can be moved into the VM itself, leaving the program code to be lean.

It varies from compiler-to-compiler, though, and there are several options to reduce it or build code differently, which will have different effects.

All this said, it's not really that important.



回答4:

In short: Java programs are compiled to Java byte code, which requires a separate interpreter (Java Virtual Machine) to be executed.

There is not a 100% guarantee that the .o file produced by the c-compiler is smaller, than the .class file produced by the Java compiler. It all depends of the implementation of the compiler.



回答5:

One of the key reasons for differences in the sizes of .o and .class files is that Java bytecodes are a bit higher-level than machine instructions. Not hugely higher-level, of course – it's still pretty low-level stuff – but that will make a difference because it effectively acts to compress the whole program. (Both C and Java code can have startup code in there.)

Another difference is that Java class files often represent relatively small pieces of functionality. While it is possible to have C object files that map to even smaller pieces, it's often more common to put more (related) functionality in a single file. The differences in scoping rules can also act to emphasize this (C doesn't really have anything that corresponds to module-level scope, but it does have file-level scope instead; Java's package scope works across multiple class files). You get a better metric if you compare the size of a whole program.

In terms of "linked" sizes, Java executable JAR files tend to be smaller (for a given level of functionality) because they're delivered compressed. It's relatively rare to deliver C programs in compressed form. (There's also differences in the size of the standard library, but they might as well be a wash because C programs can count on libraries other than libc being present, and Java programs have access to a huge standard library. Picking apart who has the advantage is awkward.)

Then, there's also the question of debugging information. In particular, if you compile a C program with debugging on that does IO, you'll get lots of information about types in the standard library included, just because it's a bit too awkward to filter it out. The Java code will only have debugging information about the actual compiled code because it can count on relevant information being available in the object file. Does this change the actual size of the code? No. But it can have a big impact on the file sizes.

Overall, I'd guess that it's hard to compare the sizes of C and Java programs. Or rather, you can compare them and easily learn nothing much useful.



回答6:

Most (as much as 90% for simple functions) of an ELF-format .o file is junk. For a .o file containing a single empty function body, you can expect a size breakdown like:

  • 1% code
  • 9% symbol and relocation table (essential for linking)
  • 90% header overhead, useless version/vendor notes stored by the compiler and/or assembler, etc.

If you want to see the real size of compiled C code, use the size command.



回答7:

A class file is Java byte code .

It is most likely smaller since C/C++ libraries and operating system libraries are linked to the object code the C++ compiler produces to finally make an executable binary.

Simply put, it is like comparing Java byte code to object code produced by a C compiler before it is linked to create a binary. The difference is the fact that a JVM interprets the Java byte code to properly do what the program is meant to do whereas C requires information from the operating system since the operating system functions as the interpreter.

Also in C Every symbol (functions etc.) you reference from an external library at least once in one of the object files is imported. If you're using it in multiple object files, it's still imported just once. There are two ways this "importing" can happen. With static linking, the actual code for a function is copied into the executable. This increases file size but has the advantage that no external libraries (.dll/.so files) are needed. With dynamic linking this doesn't happen, but as a result your program requires additional libraries to run.

In Java, everything is "linked" dynamically, so to speak.



回答8:

Java is compiled into a machine independent language. This means that after it is compiled it is then translated at run-time by the Java Virtual Machine (JVM). C is compiled to machine instructions and is therefore all of the binary for the program to run on the target machine.

Because Java is compiled to a machine independent language, the specific details for a particular machine are handled by the JVM. (i.e. C has machine specific overhead)

That is how I think about it anyway :-)



回答9:

A few potential reasons:

  • The Java class file does not include initialization code at all. It just has your one class and one function in it - very small indeed. In comparison, the C program has some degree of statically-linked initialization code, and possibly DLL thunks.
  • The C program may also have sections aligned to page boundaries - this would add a minimum of 4kb to the program size just like that, in order to ensure the code segment starts on a page boundary.