Why is np.dot imprecise? (n-dim arrays)

Suppose we take np.dot of two 'float32' 2D arrays:

res = np.dot(a, b)   # see CASE 1
print(list(res[0]))  # list shows more digits

[-0.90448684, -1.1708503, 0.907136, 3.5594249, 1.1374011, -1.3826287]

Numbers. Except, they can change:

CASE 1: slice a

np.random.seed(1)
a = np.random.randn(9, 6).astype('float32')
b = np.random.randn(6, 6).astype('float32')

for i in range(1, len(a)):
    print(list(np.dot(a[:i], b)[0])) # full shape: (i, 6)

[-0.9044868,  -1.1708502, 0.90713596, 3.5594249, 1.1374012, -1.3826287]
[-0.90448684, -1.1708503, 0.9071359,  3.5594249, 1.1374011, -1.3826288]
[-0.90448684, -1.1708503, 0.9071359,  3.5594249, 1.1374011, -1.3826288]
[-0.90448684, -1.1708503, 0.907136,   3.5594249, 1.1374011, -1.3826287]
[-0.90448684, -1.1708503, 0.907136,   3.5594249, 1.1374011, -1.3826287]
[-0.90448684, -1.1708503, 0.907136,   3.5594249, 1.1374011, -1.3826287]
[-0.90448684, -1.1708503, 0.907136,   3.5594249, 1.1374011, -1.3826287]
[-0.90448684, -1.1708503, 0.907136,   3.5594249, 1.1374011, -1.3826287]

Results differ, even though the printed slice derives from the exact same numbers multiplied.

CASE 2: flatten a, take a 1D version of b, then slice a:

np.random.seed(1)
a = np.random.randn(9, 6).astype('float32')
b = np.random.randn(1, 6).astype('float32')

for i in range(1, len(a)):
    a_flat = np.expand_dims(a[:i].flatten(), -1) # keep 2D
    print(list(np.dot(a_flat, b)[0])) # full shape: (i*6, 6)

[-0.3393164, 0.9528787, 1.3627989, 1.5124314, 0.46389243, 1.437775]
[-0.3393164, 0.9528787, 1.3627989, 1.5124314, 0.46389243, 1.437775]
[-0.3393164, 0.9528787, 1.3627989, 1.5124314, 0.46389243, 1.437775]
[-0.3393164, 0.9528787, 1.3627989, 1.5124314, 0.46389243, 1.437775]
[-0.3393164, 0.9528787, 1.3627989, 1.5124314, 0.46389243, 1.437775]
[-0.3393164, 0.9528787, 1.3627989, 1.5124314, 0.46389243, 1.437775]
[-0.3393164, 0.9528787, 1.3627989, 1.5124314, 0.46389243, 1.437775]
[-0.3393164, 0.9528787, 1.3627989, 1.5124314, 0.46389243, 1.437775]

CASE 3: stronger control; set all non-involved entires to zero: add a[1:] = 0 to CASE 1 code. Result: discrepancies persist.

CASE 4: check indices other than [0]; like for [0], results begin to stabilize a fixed # of array enlargements from their point of creation. Output

np.random.seed(1)
a = np.random.randn(9, 6).astype('float32')
b = np.random.randn(6, 6).astype('float32')

for j in range(len(a) - 2):
    for i in range(1, len(a)):
        res = np.dot(a[:i], b)
        try:    print(list(res[j]))
        except: pass
    print()

Hence, for the 2D * 2D case, results differ - but are consistent for 1D * 1D. From some of my readings, this appears to stem from 1D-1D using simple addition, whereas 2D-2D uses 'fancier', performance-boosting addition that may be less precise (e.g. pairwise addition does the opposite). Nonetheless, I'm unable to understand why discrepancies vanish in case 1 once a is sliced past a set 'threshold'; the larger a and b, the later this threshold seems to lie, but it always exists.

All said: why is np.dot imprecise (and inconsistent) for ND-ND arrays? Relevant Git

Additional info:

Environment: Win-10 OS, Python 3.7.4, Spyder 3.3.6 IDE, Anaconda 3.0 2019/10
CPU: i7-7700HQ 2.8 GHz
Numpy v1.16.5

Possible culprit library: Numpy MKL - also BLASS libraries; thanks to Bi Rico for noting

Stress-test code: as noted, discrepancies exacerbate in frequency w/ larger arrays; if above isn't reproducible, below should be (if not, try larger dims). My output

np.random.seed(1)
a = (0.01*np.random.randn(9, 9999)).astype('float32') # first multiply then type-cast
b = (0.01*np.random.randn(9999, 6)).astype('float32') # *0.01 to bound mults to < 1

for i in range(1, len(a)):
    print(list(np.dot(a[:i], b)[0]))

Problem severity: shown discrepancies are 'small', but no longer so when operating on a neural network with billions of numbers multiplied over a few seconds, and trillions over the entire runtime; reported model accuracy differs by entire 10's of percents, per this thread.

Below is a gif of arrays resulting from feeding to a model what's basically a[0], w/ len(a)==1 vs. len(a)==32:

OTHER PLATFORMS results, according and with thanks to Paul's testing:

Case 1 reproduced (partly):

Google Colab VM -- Intel Xeon 2.3 G-Hz -- Jupyter -- Python 3.6.8
Win-10 Pro Docker Desktop -- Intel i7-8700K -- jupyter/scipy-notebook -- Python 3.7.3
Ubuntu 18.04.2 LTS + Docker -- AMD FX-8150 -- jupyter/scipy-notebook -- Python 3.7.3

Note: these yield much lower error than shown above; two entries on the first row are off by 1 in the least significant digit from corresponding entries in the other rows.

Case 1 not reproduced:

Ubuntu 18.04.3 LTS -- Intel i7-8700K -- IPython 5.5.0 -- Python 2.7.15+ and 3.6.8 (2 tests)
Ubuntu 18.04.3 LTS -- Intel i5-3320M -- IPython 5.5.0 -- Python 2.7.15+
Ubuntu 18.04.2 LTS -- AMD FX-8150 -- IPython 5.5.0 -- Python 2.7.15rc1

Notes:

The linked Colab notebook and jupyter environments show a far lesser discrepancy (and only for first two rows) than is observed on my system. Also, Case 2 never (yet) showed imprecision.
Within this very limited sample, the current (Dockerized) Jupyter environment is more susceptible than the IPython environment.
np.show_config() too long to post, but in summary: IPython envs are BLAS/LAPACK-based; Colab is OpenBLAS-based. In IPython Linux envs, BLAS libraries are system-installed -- in Jupyter and Colab, they come from /opt/conda/lib

UPDATE: the accepted answer is accurate, but broad and incomplete. The question remains open for anyone who can explain the behavior at the code level - namely, an exact algorithm used by np.dot, and how it explains 'consistent inconsistencies' observed in above results (also see comments). Here are some direct implementations beyond my deciphering: sdot.c -- arraytypes.c.src

This looks like unavoidable numeric imprecision. As explained here, NumPy uses a highly-optimized, carefully-tuned BLAS method for matrix multiplication. This means that probably the sequence of operations (sum and products) followed to multiply 2 matrices, changes when the size of the matrix changes.

Trying to be clearer, we know that, mathematically, each element of the resulting matrix can be calculated as the dot product of two vectors (equal-length sequences of numbers). But this is not how NumPy calculates an element of the resulting matrix. Infact there are more efficient but complex algorithms, like the Strassen algorithm, that obtain the same result without computing directly the row-column dot product .

When using such algorithms, even if the element C ij of a resulting matrix C = A B is mathematically defined as the dot product of the i-th row of A with the j-th column of B, if you multiply a matrix A2 having the same i-th row as A with a matrix B2 having the same j-th column as B, the element C2 ij will be actually computed following a different sequence of operations (that depends on the whole A2 and B2 matrices), possibly leading to different numerical errors.

That's why, even if mathematically C ij = C2 ij (like in your CASE 1), the different sequence of operations followed by the algorithm in the calculations (due to change in matrix size) leads to different numerical errors. The numerical error explains also the slightly different results depending on the environment and the fact that, in some cases, for some environments, the numerical error might be absent.