Final subtraction in montgomery modular multiplica

I'm confused about how one might supposedly bypass the final subtraction of the modulus in radix-2 montgomery modular multiplication, when used in a modular exponentiation algorithm. The following two papers put forward the conditions for bypassing the subtraction.

Montgomery Exponentiation with no Final Subtractions: Improved Results
Montgomery Multiplication Needs no Final Subtractions

I don't understand exactly what is required in terms of the "preprocessing and postprocessing" to eliminate the need for the repetitive subtraction of the modulus at the end of the montgomery multiplication.

My understanding after reading the above papers is that, to eliminate the final subtractions, you must:

Zero-extend each input operand to the modular exponentiation by two
```
e.g. new[2049 downto 0]  = (0, 0, old[2047 downto 0]) 
```
Increase the loop bound inside the Montgomery multiplications by two, such that two more iterations of the loop are executed

I've made these modifications to a working algorithm, however the results are not as I expect and I do not understand why. Therefore, I assume I am misinterpreting something in these papers, or leaving out a critical step.

Let us refer to my (working) radix-2 montgomery modular exponentiation function in (C-like pseudocode). Note that I have extended the operand width by two most-significant zero digits (just to make sure I'm not overflowing). They used to only be 2048 bits.

let NUM_BITS = 2048
let rsaSize_t be a 2050-bit vector type

// Montgomery multiplication: outData = XYr^(-1) modulo M,     
// where the radix r=2^n    (n=NUM_BITS) 
function montMult( rsaSize_t X,       // Multiplier
                   rsaSize_t Y,       // Multiplicand
                   rsaSize_t M,       // Modulus
                   rsaSize_t outData) // Result
{
    rsaSize_t S = 0;  // Running sum

    for (i=0; i<NUM_BITS; i++)
    {
        if (X.bit(i)==1) // Check ith bit of X
            S += Y;

        if (S.bit(0)==1) // check LSB of S
            S += M;

        S = S >> 1;   // Rightshift 1 bit
    }

    // HERE IS THE FINAL SUBTRACTION I WANT (NEED) TO AVOID
    if (S >= M)
    {
        S -= M;
    }

    outData = S.range(NUM_BITS-1,0);
}


//  montgomery modular exponentiation using square and multiply algorithm
//  computes  M^e modulo n, where we precompute the transformation of the 
//  base and running-partial sum into the montgomery domain 
function rsaModExp( rsaSize_t e,     // exponent 
                    rsaSize_t n,     // modulus
                    rsaSize_t Mbar,  // precomputed: montgomery residue of the base w.r.t. the radix--> (2^2048)*base mod n 
                    rsaSize_t xbar,  // precomputed: montgomery residue of 1  w.r.t. the radix--> 2^2048 mod n                 
                    rsaSize_t *out)  // result
{
    for (i=NUM_BITS-1; i>=0; i--)
    {
        montMult(xbar, xbar, n, xbar); // square
        if (e.bit(i)==1) 
            montMult(Mbar, xbar, n, xbar); // multiply
    }

    // undo montgomery transform
    montMult(xbar, 1, n, out);
}

Am I missing something in the papers? I do not believe this is an implementation error, as my code matches exactly what is put forth in the papers. I believe that I might be a conceptual error. Any and all help appreciated.

Thanks!

Not sure what's wrong with your non-working implementation (if I understood well, what you show is a working one). In order to use the Walter optimization to compute M^e mod n, if your numbers all fit in 2048 bits, you need:

let NUM_BITS = 2050            // 2048 + 2
n < 2^(NUM_BITS - 2)           // precondition
M < 2 * n                      // precondition
let R = 2^(2 * NUM_BITS) mod n // pre-computed once for all
let M' = montMult(M, R, n)     // bring M in Montgomery form
let C' = montExp(M', e, n)     // Montgomery exponentiation
let C = montMult(C', 1, n)     // bring C' in normal form

Most important: do not forget to check the preconditions.

The Montgomery multiplications comprise NUM_BITS (2050 in your case) iterations (if-A-bit-set-add-B, if-odd-add-n, div-by-two), least significant bit first, and all numbers are represented on NUM_BITS (2050 in your case) bits.

The Montgomery exponentiation also comprises NUM_BITS (2050 in your case) iterations (square, if-e-bit-set-mult), most significant bit first, and all numbers are represented on NUM_BITS (2050 in your case) bits. Hope it helps.