Does a function with instructions before the entry

This is really a linker / object-file question, but tagging with assembly since compilers never do this. (Although maybe they could!)

Consider this function, where I want to handle one special case with a block of code that's in the same I-cache line as the function entry-point. To avoid jumping over it in the usual fast-path, is it safe (wrt. linking / shared libraries / other tools I haven't thought of) to put the code for it ahead of the function's global symbol?

I know this is silly / overkill, see below. Mostly I was just curious. Regardless of whether this technique is useful for making code that actually runs faster in practice, I think it's an interesting question.

.globl __nextafter_pjc      // double __nextafter_pjc(double x, double y)
.p2align 6  // unrealistic 64B alignment, just for the sake of argument

// GNU as local labels have the form  .L...
.Lequal_or_unordered:
    jp  .Lunordered
    movaps  %xmm1, %xmm0    # ISO C11 requires returning y, not x.  (matters for  -0.0 == +0.0)
    ret

######### Function entry point / global symbol here #############    
// .p2align something // tuning for Sandybridge, maybe best to just leave this unaligned, since it's only 6B from the alignment boundary
nextafter_pjc:
    ucomisd %xmm1, %xmm0
    je  .Lequal_or_unordered

    xorps   %xmm3, %xmm3
    comisd  %xmm3, %xmm0    // x==+/0.0 can be a special case: the sign bit may change
    je  .Lx_zero

    movq    %xmm0, %rax
    ...  // some mostly-branchless bit-ninjutsu that I have no idea how I'd get gcc to emit from C

    ret

.Lx_zero:
  ...
  ret
.Lunordered:
  ...
  ret

(BTW, I'm messing around with asm for nextafter because I was curious about how glibc implemented it. It turns out the current implementation compiles to some really nasty code with a ton of branches. e.g. checking both inputs for NaN should be done with an FP compare, because that's super-fast esp. in the non-NaN case.)

In disassembly output, instructions before the label are grouped after the previous function's instructions. e.g.

0000000000400ad0 <frame_dummy>:
                ...
  400af0:       5d                      pop    %rbp
  400af1:       e9 7a ff ff ff          jmpq   400a70 <register_tm_clones>
  400af6:       66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
  400afd:       00 00 00 
  400b00:       7a 56                   jp     400b58 <__nextafter_pjc+0x52>
  400b02:       0f 28 c1                movaps %xmm1,%xmm0
  400b05:       c3                      retq   

0000000000400b06 <__nextafter_pjc>:
  400b06:       66 0f 2e c1             ucomisd %xmm1,%xmm0
  400b0a:       74 f4                   je     400b00 <frame_dummy+0x30>
  400b0c:       0f 57 db                xorps  %xmm3,%xmm3
  400b0f:       66 0f 2f c3             comisd %xmm3,%xmm0
  400b13:       74 4b                   je     400b60 <__nextafter_pjc+0x5a>
  400b15:       66 48 0f 7e c0          movq   %xmm0,%rax
                ...

Note that the 4th instruction in the main body, comisd, starts at 400b0f (and isn't fully contained in the first 16B-aligned block that contains the function entry-point). So it's maybe not really optimal for instruction-fetch and decode for the no-taken-branches fast-path to do it exactly this way. This is just an example, though.

So this appears to work, even at the beginning of a file. It does confuse objdump, and isn't ideal in gdb (but it's not a big problem). ELF object files don't record symbol sizes anyway, so nm --print-size doesn't do anything anyway. (And nm --size-sort --print-size, which tries to calculate symbol sizes, strangely didn't include my function.)

I don't know much about Windows object files. Does anything worse happen there?

I'm slightly worried about correctness here: does anything ever try to copy single functions out of object files by taking bytes from their symbol address to the following symbol address? Normal library archives (ar for static libraries) and linkers copy whole object files around, right? Otherwise they couldn't be sure they were copying all necessary static data.

This function is probably called infrequently, and we want to minimize the cache pollution (I$, uop-cache, branch-predictors). And if anything, optimize for the un-cached case with cold branch predictors.

This is probably silly because the un-cached case can only happen infrequently. However, if many functions are all optimized this way, the total cache footprint will decrease and maybe they will all fit in cache.

Note that recent Intel CPUs don't do static branch prediction at all, so there's no reason to favor forward branches for usually-not-taken branches.

Instead of defaulting to taken for backward branches / not-taken for forward for "unknown" branches that aren't in the BHT, my understanding of Agner Fog's microarch doc (the branch prediction chapter) is that they don't check whether a branch is "new" or not. They just use whatever entry is already in the BHT, without clearing it. This may not be exactly true, though, for Nehalem.

There's a simple way to make this look totally normal: put a non-global label in front of the code. This makes it look like (or actually be) a static helper function.

Non-global functions can call each other with any calling convention they want. C compilers can even make code like this with link-time / whole-program optimization, or even just optimization of static functions within a compilation unit. Jumps (rather than calls) to another function are already used for tail-call optimization.

The "helper function" code can jump into the main function at somewhere other than the entry point. I'm sure that's not a problem for linkers though. That would only break if a linker changed the distance between the helper and the main function (by inserting something between them) without adjusting relative jumps that cross the gap that it widened. I don't think any linker would insert anything that way in the first place, and doing so without fixing any branches is clearly a bug.

I'm not sure if there are any pitfalls in generating .size ELF metadata. I think I've read that it's important for functions that will be linked into shared libraries.

The following should work fine with any tool that deals with object files:

.globl __nextafter_pjc      // double __nextafter_pjc(double x, double y)
.p2align 6  // unrealistic 64B alignment, just for the sake of argument


nextafter_helper:  # not a local label, but not .globl either
.Lequal_or_unordered:
    jp  .Lunordered
    movaps  %xmm1, %xmm0    # ISO C11 requires returning y, not x.  (matters for  -0.0 == +0.0)
    ret

######### Function entry point / global symbol here #############    
// .p2align something?
__nextafter_pjc:
    ucomisd %xmm1, %xmm0
    je  .Lequal_or_unordered

    ...    
    ret

We don't need a plain label and a "local" label, but using different labels for different purposes means less modification is needed when re-arranging things. (e.g. you can put the .Lequal_or_unordered block somewhere else without renaming it back to a .L and changing all the jumps that target it.) nextafter_equal_or_unordered would work as a single name.