BigInteger numbers implementation and performance

I have written a BigInteger class in C++ that should be able to do operations on all numbers with any size. Currently I am trying to achieve a very fast multiplication method by comparing the existing algorithms and test for which amount of digits they work best and I ran into very unexpected results.I tried to do 20 multiplications of 500-digit and I timed them. This was the result:

karatsuba:
  14.178 seconds

long multiplication:
  0.879 seconds

Wikipedia told me

It follows that, for sufficiently large n, Karatsuba's algorithm will perform fewer shifts and single-digit additions than longhand multiplication, even though its basic step uses more additions and shifts than the straightforward formula. For small values of n, however, the extra shift and add operations may make it run slower than the longhand method. The point of positive return depends on the computer platform and context. As a rule of thumb, Karatsuba is usually faster when the multiplicands are longer than 320–640 bits.

Since my numbers are at least 1500 bits long this is quite unexpected because wikipedia said karatsuba should run faster. I believe that my problem might be in my addition algorithm but I don't see how I could make it faster because it's already running in O(n). Ill post my code below so that you can check my implementations. I'll leave the irrelevant parts out of it.
I was also thinking that maybe the structure I used was not the best possible. I represented each data segment in little endian. So for example if I have the number 123456789101112 stored into data segments of length 3 it would look like this:

{112,101,789,456,123}

so this is why I am asking now what the best structure and best way to implement a BigInteger class is? And why is the karatsuba algorithm slower than the long multiplication one ?

This is my code: (I'm sorry for the length)

using namespace std;

bool __longmult=true;
bool __karatsuba=false;

struct BigInt {
public:
    vector <int> digits;

    BigInt(const char * number) {
        //constructor is not relevant   
    }
    BigInt() {}

    void BigInt::operator = (BigInt a) {
        digits=a.digits;
    }

    friend BigInt operator + (BigInt,BigInt);
    friend BigInt operator * (BigInt,BigInt);

    friend ostream& operator << (ostream&,BigInt);
};

BigInt operator + (BigInt a,BigInt b) {
    if (b.digits.size()>a.digits.size()) {
        a.digits.swap(b.digits); //make sure a has more or equal amount of digits than b
    }
    int carry=0;

    for (unsigned int i=0;i<a.digits.size();i++) {
        int sum;
        if (i<b.digits.size()) {
            sum=b.digits[i]+a.digits[i]+carry;
        } else if (carry==1) {
            sum=a.digits[i]+carry;
        } else {
            break; // if carry is 0 and no more digits in b are left then we are done already
        }

        if (sum>=1000000000) {
            a.digits[i]=sum-1000000000;
            carry=1;
        } else {
            a.digits[i]=sum;
            carry=0;
        }
    }

    if (carry) {
        a.digits.push_back(1);
    }

    return a;
}

BigInt operator * (BigInt a,BigInt b) {
    if (__longmult) {
        BigInt res;
        for (unsigned int i=0;i<b.digits.size();i++) {
            BigInt temp;
            temp.digits.insert(temp.digits.end(),i,0); //shift to left for i 'digits'

            int carry=0;
            for (unsigned int j=0;j<a.digits.size();j++) {
                long long prod=b.digits[i];
                prod*=a.digits[j];
                prod+=carry;
                int t=prod%1000000000;
                temp.digits.push_back(t);
                carry=(prod-t)/1000000000;
            }
            if (carry>0) {
                temp.digits.push_back(carry);
            }
            res+=temp;
        }
        return res;
    } else if (__karatsuba) {
        BigInt res;
        BigInt a1,a0,b1,b0;
        assert(a.digits.size()>0 && b.digits.size()>0);
        while (a.digits.size()!=b.digits.size()) { //add zeroes for equal size
            if (a.digits.size()>b.digits.size()) {
                b.digits.push_back(0);
            } else {
                a.digits.push_back(0);
            }
        }

        if (a.digits.size()==1) {
            long long prod=a.digits[0];
            prod*=b.digits[0];

            res=prod;//conversion from long long to BigInt runs in constant time
            return res;

        } else {
            for (unsigned int i=0;i<a.digits.size();i++) {
                if (i<(a.digits.size()+(a.digits.size()&1))/2) { //split the number in 2 equal parts
                    a0.digits.push_back(a.digits[i]);
                    b0.digits.push_back(b.digits[i]);
                } else {
                    a1.digits.push_back(a.digits[i]);
                    b1.digits.push_back(b.digits[i]);
                }
            }
        }

        BigInt z2=a1*b1;
        BigInt z0=a0*b0;
        BigInt z1 = (a1 + a0)*(b1 + b0) - z2 - z0;

        if (z2==0 && z1==0) {
            res=z0;
        } else if (z2==0) {
            z1.digits.insert(z1.digits.begin(),a0.digits.size(),0);
            res=z1+z0;
        } else {
            z1.digits.insert(z1.digits.begin(),a0.digits.size(),0);
            z2.digits.insert(z2.digits.begin(),2*a0.digits.size(),0);
            res=z2+z1+z0;
        }

        return res;
    }
}

int main() {
    clock_t start, end;

    BigInt a("984561231354629875468546546534125215534125215634987498548489456125215421563498749854848945612385663498749854848945612521542156349874985484894561238561698774565123165221393856169877456512316552156349874985484894561238561698774565123165221392213935215634987498548489456123856169877456512316522139521563498749854848945612385616987745651231652213949651465123151354686324848945612385616987745651231652213949651465123151354684132319321005482265341252156349874985484894561252154215634987498548489456123856264596162131");
    BigInt b("453412521563498749853412521563498749854848945612521542156349874985484894561238565484894561252154215634987498548489456123856848945612385616935462987546854521563498749854848945653412521563498749854848945612521542156349874985484894561238561238754579785616987745651231652213965465341235215634987495215634987498548489456123856169877456512316522139854848774565123165223546298754685465465341235215634987498548354629875468546546534123521563498749854844139496514651231513546298754685465465341235215634987498548435468");

    __longmult=false;
    __karatsuba=true;

    start=clock();
    for (int i=0;i<20;i++) {
        a*b;
    }
    end=clock();
    printf("\nTook %f seconds\n", (double)(end-start)/CLOCKS_PER_SEC);

    __longmult=true;
    __karatsuba=false;

    start=clock();
    for (int i=0;i<20;i++) {
        a*b;
    }
    end=clock();
    printf("\nTook %f seconds\n", (double)(end-start)/CLOCKS_PER_SEC);

    return 0;
}

You use std::vector

for your digits make sure there are no unnecessary reallocations in it. So allocate space before operation to avoid it. Also I do not use it so I do not know the array range checking slowdowns.

Check if you do not shift it !!! which is O(N) ... i.e. insert to first position...

Optimize your implementation

here you can find mine implementation optimized an unoptimized for comparison

x=0.98765588997654321000000009876... | 98*32 bits...
mul1[ 363.472 ms ]... O(N^2) classic multiplication
mul2[ 349.384 ms ]... O(3*(N^log2(3))) optimized karatsuba multiplication
mul3[ 9345.127 ms]... O(3*(N^log2(3))) unoptimized karatsuba multiplication

mine implementation threshold for Karatsuba is around 3100bits ... ~ 944 digits!!! The more optimized the code the lover threshold is.

Try to remove unnecessary data from function operands

//BigInt operator + (BigInt a,BigInt b)
BigInt operator + (const BigInt &a,const BigInt &b)

this is way you will not create another copy of a,b on heap in every + call also even faster is this:

mul(BigInt &ab,const BigInt &a,const BigInt &b) // ab = a*b

Schönhage-Strassen multiplication

this one is FFT or NTT based. Mine threshold for it is big ... ~ 49700bits ... ~ 15000digits so if you do not plan to use such big numbers then forget about it. Implementation is also in the link above.

here is mine NTT implementation (optimized as much as I could)
Summary

Does not matter if you use little or big endian but you should code your operations in a way that they do not use insert operations.

You use decadic base for digits that is slow because you need to use division and modulo operations. If you choose base as power of 2 then just bit operations are enough and also it removes many if statements from code which are slowing all the most. If you need the base as power of 10 then use biggest you can in some cases this reduce the div,mod to few subtractions
```
2^32 = 4 294 967 296 ... int = +/- 2147483648
base = 1 000 000 000

//x%=base
while (x>=base) x-=base;
```
max number of cycles is 2^32/base or 2^31/base on some platform is this faster then modulo and also the bigger the base the less operations you need but beware the overflows !!!