I am investigating caching behavior in different languages. I create two matrices in python using lists (yes, I know it's a linked list but bear with me here). I then multiply these matrices together in three ways:
def baseline_matrix_multiply(a, b, n):
'''
baseline multiply
'''
c = zero_matrix(n)
for i in range(n):
for j in range(n):
for k in range(n):
c[i][j] += a[i][k] * b[k][j]
return c
def baseline_matrix_multiply_flipjk(a, b, n):
'''
same as baseline but switch j and k loops
'''
c = zero_matrix(n)
for i in range(n):
for k in range(n):
for j in range(n):
c[i][j] += a[i][k] * b[k][j]
return c
def fast_matrix_multiply_blocking(a, b, n):
'''
use blocking
'''
c = zero_matrix(n)
block = 25;
en = int(block * n/block)
for kk in range(0, en, block):
for jj in range(0, en, block):
for i in range(n):
for j in range(jj, jj + block):
sum = c[i][j]
for k in range(kk, kk + block):
sum += a[i][k] * b[k][j]
c[i][j] = sum
return c
My timings are as follows:
Baseline:
3.440004294627216
Flip j and k:
3.4685347505603144
100.83% of baseline
Blocking:
2.729924394035205
79.36% of baseline
Some things to note:
I am familiar with CPU caching behavior. To see my experiment in C see here though I havent gotten any reviews for it.
I've done this in Javascript and C# and the flip-j-k function provides significant performance gains using arrays (JS run on chrome browser)
Python implementation is Python 3.5 by way of Anaconda
Please dont tell me about numpy. My experiment is not about absolute performance but rather understanding caching behavior.
Question: Anyone know what is going on here? Why does flip-j-k not provide speedup? Is it because it's a linked list? But then why does blocking provide a non-marginal improvement in performance?
The latter version (block multiplication) is only faster by 30% because you save 30% of index accessing by using a local variable in the inner loop!
(and FYI this is not C++:
list
type is like C++vector
otherwise people would waste their time trying to access elements by index. So this is the fastest standard random access container available in python)I just made some test program based on your code to prove my point and what I was suspecting (I had to complete your code, sorry for the big block but at least it is minimal complete & verifiable for all).
results on my PC (last results surrounded with stars are my versions):
As you can see, my version of your
baseline_matrix_multiply_flipjk
(the fourth one) is faster than even the block multiply, meaning that index checking & accessing dwarves the cache effect that you can experience in compiled languages & using direct pointers like C or C++.I just stored the values that were not changing in the inner loop (the one to optimize most) to avoid index access.
Note that I tried to apply the same recipe to the block multiply and I gained some time compared to your version, but is still beaten by the flipjk_faster version because of the unability to avoid index access.
Maybe compiling the code using Cython and drop the checks would get the result you want. But pre-computing indexes never hurts.
Python tends to not cache the results of its functions. It requires explicit notice of when you want build a cache for a function. You can do this using the
lrc_cache
decorator.The following is code I threw together the other day and I've just added some readability. If there's something wrong, comment and I'll sort it out:
BTW: "#" indicates a comment, in case you're unfamiliar with Python.
So try running this and try AGAIN without the "#" on the line
#@lru_cache(maxsize=4, typed=False)
Additionally, maxsize is the maximum size of the cache (works best in powers of two according to the docs) and typed just makes the cache add a new cached condition whenever a different type of the same value is passed as an argument.
Finally, the "Func" function is known as Ackermann's function. A stupidly deep amount of recursion goes on so be aware of stackoverflows (lol) if you should change the max recursion depth.