I'm dealing with some code that's getting 70-80% slower when compiled as PIC (position independent code), and looking for ways to alleviate the problem. A big part of the problem is that gcc insists on inserting the following in every single function:
call __i686.get_pc_thunk.bx
addl $_GLOBAL_OFFSET_TABLE_,%ebx
even if that ends up being 20% of the content of the function. Now, ebx
is a call-preserved register, and every function in the relevant translation unit (source file) is loading it with the address of the GOT, and it's easily detectable that the static
functions cannot be called from outside the translation unit (their addresses are never taken). So why can't gcc just load ebx
once at the beginning of the big external-linkage functions, and generate the static-linkage functions so that they assume ebx
has already been loaded with the address of the GOT? Is there any optimization flag I can use to force gcc to make this obvious and massive optimization, short of turning the inline limits up sky-high so everything gets inlined into the external functions?