Only pass if-statement once

2019-04-15 07:20发布

问题:

I am currently building a kernel, and have an if-statement that could (at worst) run a few million times. Yet, the result is clear after the first run. Knowing that the result of cmp is stored in a register, is there a way of remembering the result of abovementioned statement in order to not run it more often? acpi_version is GUARANTEED to never change.

SDT::generic_sdt* sdt_wrapper::get_table (size_t index) { //function is run many times with varying index
    if (index >= number_tables) { //no such index
        return NULL;
    }

    if (acpi_version == 0) [[unlikely]] {
        return (SDT::generic_sdt*) (rsdt_ptr -> PointerToOtherSDT [index]);
    } else [[likely]] {
        return (SDT::generic_sdt*) (xsdt_ptr -> PointerToOtherSDT [index]);
    }
}

As you can, see there is (atleast to me) no appearent way of getting out of having to do that statement.

An attempt I tried was using the following ASM-"HACK":

static inline uint32_t store_cmp_result (uint32_t value1, uint32_t value2) {
    uint32_t zf;
    asm volatile ( "cmp %0, %1" :: "a" (value1), "a" (value2) );
    asm volatile ( "mov %ZF, [WORD]%0" :: "a" (zf) );   
    return zf;
} 

static inline void prestored_condition (uint32_t pres, void (*true_func)(), void (*false_func) ()) {
    asm volatile ( "mov %0, %1" :: "a" (pres)  "Nd=" () );
    asm volatile ( "je %0" :: "a" (&true_func) );
    asm volatile ( "jne %0" :: "a" (&false_func) );
}

Yet, it's just a hackish solution (that didnt really work, so I discarded it).

Now to the question:

How can I ignore that if statement after it has been called once and just use the output of the last times?

回答1:

A compiler-generated cmp/jcc on acpi_version == 0 is about as cheap as you're going to get in the general case. It should predict very well because the branch always goes the same way every time, and the cost of a branch itself is quite low, even if it's taken every time. (Taken branches have slightly more cost than not-taken branches, because of their effect on the front-end fetch/decode stages and because it breaks up the used parts of the code across more I-cache lines.)

CPUs can't store compare results in any special way that's faster than testing an integer for non-zero. (i.e. a zero / non-zero integer already is how you would store a compare result!)1

In the specific case where the two sides of your if are very similar, there might be room for savings, but an easily-predicted compare+branch is very cheap. Programs are full of compare+branch, so modern CPUs have to be very good at running them. Something like 10 to 25% of the instruction count in a normal program is compare&branch, including unconditional jumps (don't quote me on the exact number, I've heard this but couldn't find a reliable source with a quick search). More branches takes up slightly more branch prediction resources, or on average worsens the prediction of other branches, but this is a small effect, too.

The compiler already will hoist the check out of loops after inlining. (By far the most important thing you can do here is make sure small accessor functions like sdt_wrapper::get_table can inline, either by putting them in the .h or using link-time optimization) Inline asm can only make it worse (http://gcc.gnu.org/wiki/DontUseInlineAsm) unless you do some super-hack like putting a label in the asm code so you can modify it or something.

If you compare so much that you think it would be worth keeping acpi_version in a fixed global register dedicated only to that, (a global register variable, which GNU C++ supports but which would probably not actually be good2 even if you think it might be), then you could instead make the condition a template parameter for all your code (or a static constexpr or CPP macro), and build 2 versions of your code: one for true and one for false. When you find out the value of the condition on bootup, unmap and reclaim the pages holding the version that will never run, and jump to the version that will run. (Or for a non-kernel, i.e. a normal program running in userspace under an OS, it's usually not a perf problem to just leave clean pages mapped, especially if untouched (including by runtime relocation)). if(acpi_version == 0) { rest_of_kernel<0>(...); } else { rest_of_kernel<1>(...); } (omitting the unmap / free part).

If rsdt_ptr and xsdt_ptr are invariant, you could at least eliminate that extra level of indirection, if both PointerToOtherSDT arrays (PTOS for short) are in static storage.


Patching the code once at startup

You didn't tag an architecture, but your (broken in too many ways to mention) asm appears to be x86 so I'll talk about that. (All modern x86 CPUs have out-of-order execution and very good branch prediction, so there's probably not much to gain there, though.)

The Linux kernel does this, but it's complex: e.g. something like .pushsection list_of_addresses_to_patch; .quad .Lthis_instance%= ; .popsection to build an array of pointers (as a special linker section) to places that need to be patched, everywhere the asm statement inlined. One place this trick is used is to patch lock prefixes to nop on uniprocessor machines running a kernel compiled with SMP support. This patching happens once, at bootup. (And it may even be able to patch the lock prefixes back in before hot-adding a CPU, because the mutex counters are still being maintained.)

In fact, Linux even uses asm goto and patches between a jmp or nop for invariant conditions like yours that are determined once at bootup, in bool _static_cpu_has(u16 bit) in arch/x86/include/asm/cpufeature.h. To start with, there's a jmp to a block that does a normal runtime check, testing a bit. But it uses .section .altinstructions,"a" / .previous to record where each jmp is, and the patch length / location. It appears cleverly constructed to work with 2-byte short vs. 5-byte long jmp rel8 / jmp rel32 jumps. So the kernel can patch all the locations where this code ends up, replacing the jmp with either a jmp to the right spot or a nop to fall through to the t_yes: return true label. gcc compiles this fairly nicely when you write if(_static_cpu_has(constant)) { ... }. After patching at some point after doing CPU feature detection, you end up with just a NOP and then fall into the loop body. (Or possibly multiple short-NOP instructions, I didn't check, but hopefully not!)

This is pretty damn cool, so I'm just going to copy the code because it's fun to see such a creative use of inline asm. I haven't looked for the code that does the patching, but obviously that + the linker script are the other key parts of this. I'm not trying to provide a workable version of it for this case, just show that the technique is possible, and where to find a GPLv2 implementation of it you could copy.

// from Linux 4.16 arch/x86/include/asm/cpufeature.h

/*
 * Static testing of CPU features.  Used the same as boot_cpu_has().
 * These will statically patch the target code for additional
 * performance.
 */
static __always_inline __pure bool _static_cpu_has(u16 bit)
{
    asm_volatile_goto("1: jmp 6f\n"
         "2:\n"
         ".skip -(((5f-4f) - (2b-1b)) > 0) * "
             "((5f-4f) - (2b-1b)),0x90\n"
         "3:\n"
         ".section .altinstructions,\"a\"\n"
         " .long 1b - .\n"      /* src offset */
         " .long 4f - .\n"      /* repl offset */
         " .word %P[always]\n"      /* always replace */
         " .byte 3b - 1b\n"     /* src len */
         " .byte 5f - 4f\n"     /* repl len */
         " .byte 3b - 2b\n"     /* pad len */
         ".previous\n"
         ".section .altinstr_replacement,\"ax\"\n"
         "4: jmp %l[t_no]\n"
         "5:\n"
         ".previous\n"
         ".section .altinstructions,\"a\"\n"
         " .long 1b - .\n"      /* src offset */
         " .long 0\n"           /* no replacement */
         " .word %P[feature]\n"     /* feature bit */
         " .byte 3b - 1b\n"     /* src len */
         " .byte 0\n"           /* repl len */
         " .byte 0\n"           /* pad len */
         ".previous\n"
         ".section .altinstr_aux,\"ax\"\n"
         "6:\n"
         " testb %[bitnum],%[cap_byte]\n"
         " jnz %l[t_yes]\n"
         " jmp %l[t_no]\n"
         ".previous\n"
         : : [feature]  "i" (bit),
             [always]   "i" (X86_FEATURE_ALWAYS),
             [bitnum]   "i" (1 << (bit & 7)),
             [cap_byte] "m" (((const char *)boot_cpu_data.x86_capability)[bit >> 3])
         : : t_yes, t_no);
t_yes:
    return true;
t_no:
    return false;
}

Runtime binary patching for your specific case

In your specific case, the difference between your two versions is which global(?) pointer you're dereferencing, and the type of PTOS. Storing a pointer to the base of the right array (as a void* or char*) is easy with pure C++, but indexing differently is tricky. In your case, it's an array of uint32_t or uint64_t, as a flexible array member at the end of a struct. (Actually uint32_t PTOS[1] because ISO C++ doesn't support flexible array members, but if you're going to use GNU inline asm syntax, you a proper flexible array member like uint32_t PTOS[] is probably a good idea).

On x86-64, changing the scale factor in an indexed addressing mode from 4 to 8 would do the trick, because a 64-bit load vs. a zero-extending 32-bit load uses the same opcode, just REX.W=0 (or no REX prefix) vs. REX.W=1 for the operand size. .byte 0x40; mov eax, [rdx + rdi*4] has the same length as mov rax, [rdx + rdi*8]. (The 0x40 byte in the first is a REX prefix with all its bits clear. The 2nd version needs REX.W=1 for 64-bit operand size; the first one zero extends into RAX by writing EAX. If the first version already needed a REX prefix for a register like r10, it would already have a REX prefix.) Anyway, patching one to the other would be easy if you knew where all the relevant instructions were located.

If you had an infrastructure for recording places to patch, you'd use it to patch a mov instruction that gets a table pointer and index in registers, and returns a 64-bit value (from a 32 or 64-bit load). (And don't forget a dummy input to tell the compiler you actually read the memory pointed to by the table pointer, otherwise the compiler is allowed to do optimizations that could break your code like moving stores across the asm statement). But you'd have to be careful; inline asm can hurt optimization by disabling constant propagation (e.g. for index). At least if you omit volatile, the compiler is allowed to consider it a pure function of the inputs and CSE it.


Hacking this up in pure C++

On x86, the scale-factor in an addressing has to be coded into the instruction. Even with a runtime-invariant, you (or the compiler) would still need a variable-count shift, or a multiply, to pull this off without self-modifying code (which compilers won't emit).

A variable-count shift costs 3 uops on an Intel Sandybridge-family CPU (http://agner.org/optimize/) (because of legacy CISC semantics; count=0 leaves EFLAGS unmodified so EFLAGS is an input to variable-count shifts.) Unless you let the compiler use BMI2 for shlx (flagless shifts). index += foo ? 0 : index would conditionally double index (a difference of one shift count), but doing that branchlessly is not worth it on x86 for a condition that predicts that well.

Using a variable-count shift instead of a scaled-index addressing mode could be more costly than a well-predicted conditional branch.

uint64_t vs. uint32_t without runtime patching is another problem; one version needs to do a zero-extending 32-bit load, and the other needs to do a 64-bit load (unless the upper bytes happen to always be zero for you?) We could always do a 64-bit load and then mask to keep or discard the upper 32 bits, but that needs another constant. And it can suffer a performance penalty if the load crosses a cache-line (or worse, page) boundary. e.g. if the 32-bit value was the last one in a page, a normal 32-bit load would have just loaded it, but a 64-bit load + mask would need to load data from the next page.

But with both of these things taken together, it's really not worth it. Just for fun, here's what you could do: source + asm output on the Godbolt compiler explorer

// I'm assuming rsdt_ptr and xsdt_ptr are invariants, for simplicity
static const char *selected_PTOS;
static uint64_t opsize_mask;   // 0x00000000FFFFFFFF or all-ones
static unsigned idx_scale;     // 2 or 3
// set the above when the value for acpi_version is found

void init_acpi_ver(int acpi_version) {
    ... set the static vars;
}

// branchless but slower than branching on a very-predictable condition!
SDT::generic_sdt* sdt_wrapper::get_table (size_t index)
{
    const char *addr = selected_PTOS + (index << idx_scale);
    uint64_t entry = *reinterpret_cast<const uint64_t*>(addr);
    entry &= opsize_mask;      // zero-extend if needed
    return reinterpret_cast<SDT::generic_sdt*>(entry);
}

Asm output from Godbolt (with simpler types so it actually compiles)

get_table(unsigned long):
    mov     ecx, DWORD PTR idx_scale[rip]
    mov     rax, QWORD PTR selected_PTOS[rip]  # the table
    sal     rdi, cl
    mov     rax, QWORD PTR [rax+rdi]        # load the actual data we want
    and     rax, QWORD PTR opsize_mask[rip]
    ret

With inlining and CSE, the compiler could keep some of those mask and shift-count values in registers, but that's still extra work (and ties up registers).

And BTW, do not make the static vars locals inside a function; that would force the compiler to check every time to see if it was the first time the function had executed. The fast path for a static local (all that runs once the dust has settled from the init code) is pretty cheap, but about the same cost as what you're trying to avoid: a branch on an integer being non-zero!

int static_local_example() {
    static int x = ext();
    return x;
}

    # gcc7.3
    movzx   eax, BYTE PTR guard variable for static_local_example()::x[rip]
    test    al, al
    je      .L11
    # x86 loads are always acquire-loads, other ISAs would need a barrier after loading the guard
    mov     eax, DWORD PTR static_local_example()::x[rip]
    ret

A static function pointer (at class or file scope, not function) is worth considering, but replacing a conditional branch with an unconditional indirect call is unlikely to be a win. And then you have function-call overhead (clobbered registers, arg-passing). Compilers would normally try to devirtualize back into a conditional branch as an optimization!


Footnote 1: If your condition had been acpi_version == 4, then MIPS could have saved one instruction from storing a 0 / 1 result. Instead of comparing into flags, it has compare-into-register, and branch instructions that compare against zero or a register, and a register that already reads as zero. Even on x86, comparing for zero / non-zero saves a byte of code-size if the value is already in a register (test eax,eax vs. cmp eax,4). It would save more if it was the result of an ALU instruction (so ZF would already be set), but it's not.

But most other architectures compare into flags, and you can't load from memory directly into flags. So you'd only want to store a static bool result if acpi_version was expensive to compare, like an integer wider than a register such as __int128, or int64_t on a 32-bit machine.

Footnote 2: Don't use a global register variable for acpi_version; that would be silly. If it's used everywhere, then hopefully link-time optimization can do a good job at hoisting the comparison out of things.

Branch prediction + speculative execution means that the CPU doesn't actually have to wait for a load result when you branch on it, and if you read it all the time it will stay hot in L1d cache anyway. (Speculative execution means that control dependencies aren't part of the critical path, assuming correct branch prediction)


PS: if you made it this far and understood everything, then you should consider using binary patching like Linux does for a few frequently-checked conditions. If not, you probably shouldn't!



回答2:

Your particular function isn't a good candidate for any optimization which removes if the if() for the reasons Peter mentions in his answer, particularly that (1) it is generally extremely cheap to calculate and (2) it is by definition off the critical path since it is solely a control dependency and so doesn't appear as a continued part of any dependency chain.

That said, one general pattern to do this kind of "one-off" runtime behavior selection in cases where the check is actually somewhat expensive, is to use a function pointer in combination with a trampoline-like function that selects one of two (or more implementations) the first time it is called and overwrites the function pointer with the chosen implementation.

For illustration purposes, it would be something like this in your case. First, declare a typedef for your get_table function, and define a function pointer with the name get_table rather than a function1:

typedef generic_sdt* (*get_table_fn)(size_t index);

// here's the pointer that you'll actually "call"
get_table_fn get_table = get_table_trampoline;

Note that get_table is initialized to get_table_trampoline which is our selector function, called only once, to choose the runtime implementation. It looks like:

generic_sdt* get_table_trampoline(size_t index) {
    // choose the implementation
    get_table_fn impl = (acpi_version == 0) ? zero_impl : nonzero_impl;
    // update the function pointer to redirect future callers
    get_table = impl;
    // call the selected function 
    return impl(index);
}

It just selects, based on acpi_version whether to use the zero_impl or nonzero_impl versions of the function, which are just your implementation with the if statement already resolved, like this:

generic_sdt* zero_impl(size_t index) {
    if (index >= number_tables) { //no such index
        return NULL;
    }

    return (SDT::generic_sdt*) (rsdt_ptr -> PointerToOtherSDT [index]);
}

generic_sdt* nonzero_impl(size_t index) {
    if (index >= number_tables) { //no such index
        return NULL;
    }

    return (SDT::generic_sdt*) (xsdt_ptr -> PointerToOtherSDT [index]);
}

Now all subsequent callers jump directly to the underlying simplified implementation with the if.

If the original code was actually calling get_table in the underlying assembly (i.e., it wasn't inlined, probably because it wasn't declared in a header file), the transformation from a function call to a call through a function pointer is likely to have only a small to zero performance impact when the indirect call is correctly predicted - and since the target is fixed after the first call, it will be well predicted after first few calls unless your indirect BTB is under pressure.

If callers were able to inline the original get_table call, this approach is much less desirable, since it will inhibit inlining. Your original suggestion of self-modifying code wasn't going to work at all with inlining, however.

As mentioned above, for the specific case of removing a single well-predicted if, this isn't going to do much: I expect it to be about wash (even if you just removed the if and hardcoded one case I don't think you'll find it faster) - but this technique can be useful for more complex cases. Think of as kind of a "self-modifying code light": where you only modify a function pointer, not the actual underlying code.

In a multi-threaded program, this is going invoke undefined behavior in theory, although rarely2 in practice. To bring it into compliance you'll need to wrap the get_table pointer in std::atomic and use the appropriate methods to load and store it (using memory_order_relaxed should be enough, effectively using the racy-single-check idiom).

Alternately, you can avoid the trampolines and memory ordering concerns entirely by initializing the function pointers before their first use, if you have a suitable place in your code to that.


1 I removed the namespaces here to keep things more succinct and to make it more likely my unverified code would actually approximately compile.

2 Here I'm using rarely in a very restricted sense: I'm not saying this will compile to problematic multi-threaded code but that you'll rarely experience a problem at runtime (that would be pretty bad). Rather, I'm saying that this will compile to correct multi-threaded code on most compilers and platforms. So it will be rare that the underlying generated code is at all unsafe (and indeed patterns like this were heavily used and supported in pre-C++11 atomics).