Retrieving return address of an exception on ARM C

I am trying to retrieve the return address of an IRQ handler in my code. My aim is to save the value of the PC just before the watchdog timer expires and before the reset for debug purposes, using WDT_IRQHandler(). I am also testing this approach with other IRQs to check if I grasped the idea. But it seems I haven't.

I have read the documentation available. I understood that when an exception happens, 8 registers are pushed to the stack: R0, R1, R2, R3, R12, LR, PC and XPSR.

I have also read that the stack is automatically double word aligned. So in my mind, retrieving the return address is as easy as:

retrieve the sp address with __builtin_frame_address(0);
add to it the offset of the stacked PC (0x18), and read the value, which supposedly is the value that will be restored to the PC when the handler returns.

Checking with the debugger attached, this seems not the case, the content at that memory address doesn't always point to a flash area, or even to a valid area, and in any case it is never the value that PC will assume after the POP instruction.

The code works fine, so I think it's a problem I have in understanding how it works.

If I check the disassembly, in some IRQs a constant is added to the sp before POPping (?)

00001924: 0x000009b0 ...TE_IRQHandler+280   add     sp, #36 ; 0x24
00001926: 0x0000f0bd ...TE_IRQHandler+282   pop     {r4, r5, r6, r7, pc}

In other IRQs this doesn't happen.

I understand that it can happen that more registers are pushed to the stack, so how can I be sure at which offset to retrieve the PC?

If I check the memory dump around the SP when the code is still in the IRQ handler, I can spot the return address, but it's always at a weird location, with a negative offset compared to the SP. I can't understand how to obtain the right address.

You can't rely on the stack pointer inside of the C handler because of two reasons:

Registers are always pushed to the active stack for the preempted code. Handlers always use the main stack (MSP). If the interrupt preempts thread-mode code that's running from the process stack (PSP) then the registers will be pushed to the PSP and you'll never find them in the handler stack;
The C routine will probably reserve some stack space for local variables, and you don't know how much that is, so you won't be able to locate the registers.

This is how I usually do it:

void WDT_IRQHandler_real(uint32_t *sp)
{
    /* PC is sp[6] (sp + 0x18) */
    /* ... your code ... */
}

/* Cortex M3/4 */
__attribute__((naked)) void WDT_IRQHandler()
{
    asm volatile (
        "TST   LR, #4\n\t"
        "ITE   EQ\n\t"
        "MRSEQ R0, MSP\n\t"
        "MRSNE R0, PSP\n\t"
        "LDR   R1, =WDT_IRQHandler_real\n\t"
        "BX    R1"
    );
}

/* Cortex M0/1 */
__attribute__((naked)) void WDT_IRQHandler()
{
    asm volatile (
        "MRS R0, MSP\n\t"
        "MOV R1, LR\n\t"
        "MOV R2, #4\n\t"
        "TST R1, R2\n\t"
        "BEQ WDT_IRQHandler_call_real\n\t"
        "MRS R0, PSP\n"
    "WDT_IRQHandler_call_real:\n\t"
        "LDR R1, =WDT_IRQHandler_real\n\t"
        "BX  R1"
    );
}

The trick here is that the handler is a small piece of assembly (I used a naked function with GCC asm, you can also use a separate asm file) that passes the stack pointer to the real handler. Here's how it works (for M3/4):

The initial value of LR in an exception handler is known as EXC_RETURN (more info here). Its bits have various meaning, we're interested in the fact that EXC_RETURN[2] is 0 if the active stack was the MSP and 1 if the active stack was the PSP;
TST LR, #4 checks EXC_RETURN[2] and sets condition flags;
MRSEQ R0, MSP moves the MSP into R0 if EXC_RETURN[2] == 0;
MRSNE R0, PSP moves the PSP into R0 if EXC_RETURN[2] == 1;
Finally, LDR/BX jumps to the real function (R0 is the first argument).

The M0/1 variant is similiar but uses branches since the core does not support IT blocks.

This solves the MSP/PSP issue and, since it runs before any compiler-generated stack operation, it will provide a reliable pointer. I used a simple (non-linked) branch to the function because I don't have to do anything after it and LR is already good to go. It saves a few cycles and an LR push/pop. Also all registers used are in the R0-R3 scratch range, so there's no need to preserve them.