Eliminating redundant loads of GOT register?

2019-08-03 06:36发布

问题:

I'm dealing with some code that's getting 70-80% slower when compiled as PIC (position independent code), and looking for ways to alleviate the problem. A big part of the problem is that gcc insists on inserting the following in every single function:

call __i686.get_pc_thunk.bx
addl $_GLOBAL_OFFSET_TABLE_,%ebx

even if that ends up being 20% of the content of the function. Now, ebx is a call-preserved register, and every function in the relevant translation unit (source file) is loading it with the address of the GOT, and it's easily detectable that the static functions cannot be called from outside the translation unit (their addresses are never taken). So why can't gcc just load ebx once at the beginning of the big external-linkage functions, and generate the static-linkage functions so that they assume ebx has already been loaded with the address of the GOT? Is there any optimization flag I can use to force gcc to make this obvious and massive optimization, short of turning the inline limits up sky-high so everything gets inlined into the external functions?

回答1:

There is probably no generic cure for this, but you could try to play around with inlining options. I'd guess that static functions in a compilation unit don't have too many callers, so the overhead in code replication wouldn't be too bad.

The easiest way to force such things with gcc would be to set an attribute((always_inline)). You could play around with a gcc dependent macro to ensure portability.

If you don't want to modify the code (but static inline would be good anyhow) you could use the -finline-limit option to fine tune that.



回答2:

Not really a solution, but: if the functions in question do not reference file-scope variables, you could put them all together in a single translation unit and compile it without -fPIC flag. Then you link them together with other files in the final SO as usual.