How to use the APIC to create IPIs to wake the APs

In a post-boot enviroment (no OS), how would one use the BSP (first core/processor) to create IPIs for the APs (all other cores/processors)? Essentially, how does one wake and set the instruction pointer for the other cores when starting from one?

WARNING: I've assumed 80x86 here. If it's not 80x86 then I don't know :-)

First you need to find out how many other CPUs exist and what their APIC IDs are, and determine the physical address of the local APICs. To do this you parse ACPI tables (see MADT/APIC in the ACPI specification). If you can't find valid ACPI tables (e.g. computer is too old) there's an older "MultiProcessor Specification" that defines its own tables with the same information in it. Note that the "MultiProcessor Specification" is deprecated now (and there are some computers with dummy MultiProcessor tables) which is why you need to check the ACPI tables first.

The next step is to determine what type of local APIC you have. There are 3 cases - old external "82489DX" local APICs (not built into the CPU itself), xAPIC and x2APIC.

Start by checking CPUID to determine if the local APIC is x2APIC. If it is you have 2 choices - you can use x2APIC, or you can use "xAPIC compatibility mode". For "xAPIC compatibility mode" you can only use 8-bit APIC IDs and won't be able to support computers with lots of CPUs (e.g. 255 or more CPUs). I'd recommend using x2APIC (even if you don't care about computers with lots of CPUs) as its faster. If you do use x2APIC mode then you'll need to switch the local APIC into this mode.

Otherwise, if its not x2APIC, read the local APIC's version register. If the local APIC's version is 0x10 or higher then its xAPIC, and if it's 0x0F or lower then it's an external "82489DX" local APIC.

The old external "82489DX" local APICs were used in 80486 and older computers, and these are extremely rare (they were very rare 20 years ago, then most died and/or got replaced and thrown away since). Because a different sequence is used to start other CPUs, and because computers that have these local APICs are extremely rare (e.g. you will probably never be able to test your code) it makes a lot of sense to not bother supporting these computers. If you support these old computers at all; I'd recommend treating them as "single-CPU only" and simply not starting any other CPU/s if the local APIC is "82489DX". For this reason I won't describe the method used to start them here (it is described in Intel's "MultiProcess Specification" if you're curious).

For xAPIC and x2APIC, the sequence for starting another CPU is essentially the same (just different ways of accessing the local APIC - MSRs or memory mapped). I'd recommend using (e.g.) function pointers to hide these differences; so that later code can call a "send IPI" function via. the function pointer without caring if the local APIC is x2APIC or xAPIC.

To actually start the another CPU you need to send a sequence of IPIs (Inter Processor Interrupts) to it. Intel's method goes like this:

Send an INIT IPI to the CPU you're starting
Wait for 10 ms
Send a STARTUP IPI to the CPU you're starting
Wait for 200 us
Send another STARTUP IPI to the CPU you're starting
Wait for 200 us
Wait for started CPU to set a flag (so you know it started)
    If flag was set by other CPU, other CPU was started successfully
    Else if time-out, other CPU failed to start

There are 2 problems with Intel's method. Often the other CPU will be started by the first STARTUP IPI, and in some cases this can lead to problems (e.g. if the other CPU's startup code does something like total_CPUs++; then each CPU might execute it twice. To avoid this problem you can add extra synchronisation (e.g. other CPU waits for an "I know you started" flag to be set by the first CPU before it continues). The second problem with Intel's method is measuring those delays. Typically an OS starts the other CPUs, then figures out what features the CPUs support and what hardware is present afterwards, and doesn't have precise timer/s setup to measure those 200 us delays accurately.

To avoid those problems; I use an alternative method that goes like this:

Send an INIT IPI to the CPU you're starting
Wait for 10 ms
Send a STARTUP IPI to the CPU you're starting
Wait for started CPU to set a flag (so you know it started) with a short timeout (e.g. 1 ms)
    If flag was set by other CPU, other CPU was started successfully
    Else if time-out
        Send another STARTUP IPI to the CPU you're starting
        Wait for started CPU to set a flag with a long timeout (e.g. 200 ms)
            If flag was set by other CPU, other CPU was started successfully
            Else if time-out, other CPU failed to start
If CPU started successfully
    Set flag to tell other CPU it can continue

Also note that you need to start CPUs individually. I've seen people start all CPUs at the same time using the "broadcast IPI to all but self" feature - this is wrong and broken and dodgy (don't do it unless you're writing firmware). The problem with this is that some CPUs may be faulty (e.g. failed their BIST/built-in self test) and some CPUs may be disabled (e.g. hyper-threading when hyper-threading is disabled in firmware); and the "broadcast IPI to all but self" method can start CPUs that should never have been started.

Finally, for computers with a large number of CPUs it can take a relatively long time to start them all if you're starting them one at a time. For example, if it takes 11 ms to start each CPU and there are 128 CPUs, then it'd take 1.4 seconds. If you want to boot fast there are ways to avoid this. For example, the first CPU can start the second CPU, then the 1st and 2nd CPU can start the 3rd and 4th CPU, then those four CPUs can start the next four CPUs, etc. In this way you can start 128 CPUs in 77 ms instead of 1.4 seconds.

Note: I'd recommend just starting CPUs one at a time and making sure that works before you attempt any kind of "parallel startup" (it's something you can worry about afterwards after you know the rest works).

The address that the other CPU/s will begin executing is encoded in the "vector" field of the STARTUP IPI. The CPU will start executing code (in real mode) with CS = vector * 256 and IP = 0. The vector field is 8-bit, so the highest starting address you can use is 0x000FF000 (0xFF00:0x0000 in real mode). However, this is the legacy ROM area (in practice the starting address would have to be lower). Typically you'd copy a little piece of startup code into a suitable address; where the startup code handles synchronisation (e.g. setting an "I started" flag that another CPU can see and waiting to be told it's OK to continue) and then does things like enabling protected/long mode and setting up a stack before jumping to an entry point in the OS's normal code. This little piece of startup code is called the "AP CPU startup trampoline". This is also what makes the "parallel startup" a little complicated; as each CPU being started needs its own/separate synchronisation flags and stack; and because these things are normally implemented with variables in the trampoline (e.g. mov esp,[cs:stackTop]) it means end up with multiple trampolines.