Python Matrix Multiplication and Caching

I am investigating caching behavior in different languages. I create two matrices in python using lists (yes, I know it's a linked list but bear with me here). I then multiply these matrices together in three ways:

def baseline_matrix_multiply(a, b, n):
    '''
    baseline multiply
    '''
    c = zero_matrix(n)
    for i in range(n):
        for j in range(n):
            for k in range(n):
                c[i][j] += a[i][k] * b[k][j]
    return c


def baseline_matrix_multiply_flipjk(a, b, n):
    '''
    same as baseline but switch j and k loops
    '''
    c = zero_matrix(n)
    for i in range(n):
        for k in range(n):
            for j in range(n):
                c[i][j] += a[i][k] * b[k][j]
    return c


def fast_matrix_multiply_blocking(a, b, n):
    '''
    use blocking
    '''
    c = zero_matrix(n)

    block = 25;
    en = int(block * n/block)

    for kk in range(0, en, block):
        for jj in range(0, en, block):
            for i in range(n):
                for j in range(jj, jj + block):
                    sum = c[i][j]
                    for k in range(kk, kk + block):
                        sum += a[i][k] * b[k][j]
                    c[i][j] = sum
    return c

My timings are as follows:

Baseline:
3.440004294627216

Flip j and k:
3.4685347505603144
100.83% of baseline

Blocking:
2.729924394035205
79.36% of baseline

Some things to note:

I am familiar with CPU caching behavior. To see my experiment in C see here though I havent gotten any reviews for it.
I've done this in Javascript and C# and the flip-j-k function provides significant performance gains using arrays (JS run on chrome browser)
Python implementation is Python 3.5 by way of Anaconda
Please dont tell me about numpy. My experiment is not about absolute performance but rather understanding caching behavior.

Question: Anyone know what is going on here? Why does flip-j-k not provide speedup? Is it because it's a linked list? But then why does blocking provide a non-marginal improvement in performance?

标签： python caching matrix

2条回答

仙女界的扛把子

2楼-- · 2019-09-05 05:05

The latter version (block multiplication) is only faster by 30% because you save 30% of index accessing by using a local variable in the inner loop!

(and FYI this is not C++: list type is like C++ vector otherwise people would waste their time trying to access elements by index. So this is the fastest standard random access container available in python)

I just made some test program based on your code to prove my point and what I was suspecting (I had to complete your code, sorry for the big block but at least it is minimal complete & verifiable for all).

def zero_matrix(sz):
    return [[0]*sz for i in range(sz)]

def baseline_matrix_multiply(a, b, n):
    '''
    baseline multiply
    '''
    c = zero_matrix(n)
    for i in range(n):
        for j in range(n):
            for k in range(n):
                c[i][j] += a[i][k] * b[k][j]
    return c


def baseline_matrix_multiply_flipjk(a, b, n):
    '''
    same as baseline but switch j and k loops
    '''
    c = zero_matrix(n)
    for i in range(n):
        for k in range(n):
            for j in range(n):
                c[i][j] += a[i][k] * b[k][j]
    return c

def baseline_matrix_multiply_flipjk_faster(a, b, n):
    '''
    same as baseline but switch j and k loops
    '''
    c = zero_matrix(n)
    for i in range(n):
        ci = c[i]
        for k in range(n):
            bk = b[k]
            aik = a[i][k]
            for j in range(n):
                ci[j] += aik * bk[j]
    return c

def fast_matrix_multiply_blocking(a, b, n):
    '''
    use blocking
    '''
    c = zero_matrix(n)

    block = 25;
    en = int(block * n/block)

    for kk in range(0, en, block):
        for jj in range(0, en, block):
            for i in range(n):
                for j in range(jj, jj + block):
                    sum = c[i][j]
                    for k in range(kk, kk + block):
                        sum += a[i][k] * b[k][j]
                    c[i][j] = sum
    return c

def fast_matrix_multiply_blocking_faster(a, b, n):
    '''
    use blocking
    '''
    c = zero_matrix(n)

    block = 25;
    en = int(block * n/block)

    for kk in range(0, en, block):
        for jj in range(0, en, block):
            for i in range(n):
                ai = a[i]
                ci = c[i]
                for j in range(jj, jj + block):
                    s = ci[j]
                    for k in range(kk, kk + block):
                        s += ai[k] * b[k][j]
                    ci[j] = s
    return c

def init_ab(sz):
    return [list(range(sz)) for i in range(sz)],[[3]*sz for i in range(sz)]

sz=200


import time

a,b = init_ab(sz)

start_time=time.time()
baseline_matrix_multiply(a,b,sz)
print("baseline_matrix_multiply: "+str(time.time()-start_time))

a,b = init_ab(sz)

start_time=time.time()
baseline_matrix_multiply_flipjk(a,b,sz)
print("baseline_matrix_multiply_flipjk: "+str(time.time()-start_time))

a,b = init_ab(sz)


start_time=time.time()
fast_matrix_multiply_blocking(a,b,sz)
print("fast_matrix_multiply_blocking: "+str(time.time()-start_time))

a,b = init_ab(sz)

start_time=time.time()
baseline_matrix_multiply_flipjk_faster(a,b,sz)
print("**baseline_matrix_multiply_flipjk_faster**: "+str(time.time()-start_time))

a,b = init_ab(sz)

start_time=time.time()
fast_matrix_multiply_blocking_faster(a,b,sz)
print("**fast_matrix_multiply_blocking_faster**: "+str(time.time()-start_time))

results on my PC (last results surrounded with stars are my versions):

baseline_matrix_multiply: 2.578160047531128
baseline_matrix_multiply_flipjk: 2.5315518379211426
fast_matrix_multiply_blocking: 1.9359750747680664
**baseline_matrix_multiply_flipjk_faster**: 1.4532990455627441
**fast_matrix_multiply_blocking_faster**: 1.7031919956207275

As you can see, my version of your baseline_matrix_multiply_flipjk (the fourth one) is faster than even the block multiply, meaning that index checking & accessing dwarves the cache effect that you can experience in compiled languages & using direct pointers like C or C++.

I just stored the values that were not changing in the inner loop (the one to optimize most) to avoid index access.

Note that I tried to apply the same recipe to the block multiply and I gained some time compared to your version, but is still beaten by the flipjk_faster version because of the unability to avoid index access.

Maybe compiling the code using Cython and drop the checks would get the result you want. But pre-computing indexes never hurts.

0人赞添加讨论(0) 举报

趁早两清

3楼-- · 2019-09-05 05:09

Python tends to not cache the results of its functions. It requires explicit notice of when you want build a cache for a function. You can do this using the lrc_cache decorator.

The following is code I threw together the other day and I've just added some readability. If there's something wrong, comment and I'll sort it out:

from functools import lru_cache

from random import randint as rand
from time import clock as clk

recur = 0

#@lru_cache(maxsize=4, typed=False)
def Func(m, n):
    global recur
    recur += 1
    if m == 0:
        return n + 1
    elif n == 0:
        return Func(m - 1, 1)
    else:
        return Func(m - 1, Func(m, n - 1))


n = []
m = 0

ITER = 50
F1 = 3
F2 = 4


staTime = clk()
for x in range (ITER):
    n.append(Func(F1, F2))

m += clk()-staTime

print("Uncached calls of Func:: "+str(recur//ITER))
print("Average time per solving of Ackerman's function:: "+str(m/ITER))
print("End result:: "+str(n[0]))

BTW: "#" indicates a comment, in case you're unfamiliar with Python.

So try running this and try AGAIN without the "#" on the line #@lru_cache(maxsize=4, typed=False)

Additionally, maxsize is the maximum size of the cache (works best in powers of two according to the docs) and typed just makes the cache add a new cached condition whenever a different type of the same value is passed as an argument.

Finally, the "Func" function is known as Ackermann's function. A stupidly deep amount of recursion goes on so be aware of stackoverflows (lol) if you should change the max recursion depth.

0人赞添加讨论(0) 举报

Python Matrix Multiplication and Caching

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间