This is really a linker / object-file question, but tagging with assembly since compilers never do this. (Although maybe they could!)
Consider this function, where I want to handle one special case with a block of code that's in the same I-cache line as the function entry-point. To avoid jumping over it in the usual fast-path, is it safe (wrt. linking / shared libraries / other tools I haven't thought of) to put the code for it ahead of the function's global symbol?
I know this is silly / overkill, see below. Mostly I was just curious. Regardless of whether this technique is useful for making code that actually runs faster in practice, I think it's an interesting question.
.globl __nextafter_pjc // double __nextafter_pjc(double x, double y)
.p2align 6 // unrealistic 64B alignment, just for the sake of argument
// GNU as local labels have the form .L...
.Lequal_or_unordered:
jp .Lunordered
movaps %xmm1, %xmm0 # ISO C11 requires returning y, not x. (matters for -0.0 == +0.0)
ret
######### Function entry point / global symbol here #############
// .p2align something // tuning for Sandybridge, maybe best to just leave this unaligned, since it's only 6B from the alignment boundary
nextafter_pjc:
ucomisd %xmm1, %xmm0
je .Lequal_or_unordered
xorps %xmm3, %xmm3
comisd %xmm3, %xmm0 // x==+/0.0 can be a special case: the sign bit may change
je .Lx_zero
movq %xmm0, %rax
... // some mostly-branchless bit-ninjutsu that I have no idea how I'd get gcc to emit from C
ret
.Lx_zero:
...
ret
.Lunordered:
...
ret
(BTW, I'm messing around with asm for nextafter
because I was curious about how glibc implemented it. It turns out the current implementation compiles to some really nasty code with a ton of branches. e.g. checking both inputs for NaN should be done with an FP compare, because that's super-fast esp. in the non-NaN case.)
In disassembly output, instructions before the label are grouped after the previous function's instructions. e.g.
0000000000400ad0 <frame_dummy>:
...
400af0: 5d pop %rbp
400af1: e9 7a ff ff ff jmpq 400a70 <register_tm_clones>
400af6: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
400afd: 00 00 00
400b00: 7a 56 jp 400b58 <__nextafter_pjc+0x52>
400b02: 0f 28 c1 movaps %xmm1,%xmm0
400b05: c3 retq
0000000000400b06 <__nextafter_pjc>:
400b06: 66 0f 2e c1 ucomisd %xmm1,%xmm0
400b0a: 74 f4 je 400b00 <frame_dummy+0x30>
400b0c: 0f 57 db xorps %xmm3,%xmm3
400b0f: 66 0f 2f c3 comisd %xmm3,%xmm0
400b13: 74 4b je 400b60 <__nextafter_pjc+0x5a>
400b15: 66 48 0f 7e c0 movq %xmm0,%rax
...
Note that the 4th instruction in the main body, comisd
, starts at 400b0f
(and isn't fully contained in the first 16B-aligned block that contains the function entry-point). So it's maybe not really optimal for instruction-fetch and decode for the no-taken-branches fast-path to do it exactly this way. This is just an example, though.
So this appears to work, even at the beginning of a file. It does confuse objdump
, and isn't ideal in gdb
(but it's not a big problem). ELF object files don't record symbol sizes anyway, so nm --print-size
doesn't do anything anyway. (And nm --size-sort --print-size
, which tries to calculate symbol sizes, strangely didn't include my function.)
I don't know much about Windows object files. Does anything worse happen there?
I'm slightly worried about correctness here: does anything ever try to copy single functions out of object files by taking bytes from their symbol address to the following symbol address? Normal library archives (ar
for static libraries) and linkers copy whole object files around, right? Otherwise they couldn't be sure they were copying all necessary static data.
This function is probably called infrequently, and we want to minimize the cache pollution (I$, uop-cache, branch-predictors). And if anything, optimize for the un-cached case with cold branch predictors.
This is probably silly because the un-cached case can only happen infrequently. However, if many functions are all optimized this way, the total cache footprint will decrease and maybe they will all fit in cache.
Note that recent Intel CPUs don't do static branch prediction at all, so there's no reason to favor forward branches for usually-not-taken branches.
Instead of defaulting to taken for backward branches / not-taken for forward for "unknown" branches that aren't in the BHT, my understanding of Agner Fog's microarch doc (the branch prediction chapter) is that they don't check whether a branch is "new" or not. They just use whatever entry is already in the BHT, without clearing it. This may not be exactly true, though, for Nehalem.