Interpretation vs dynamic dispatch penalty in Pyth

2020-01-31 11:50发布

问题:

I watched Brandon Rhodes' talk about Cython - "The Day of the EXE Is Upon Us".

Brandon mentions at 09:30 that for a specific short piece of code, skipping interpretation gave 40% speedup, while skipping the allocation and dispatch gave 574% speedup (10:10).

My question is - how is this measured for a specific piece of code? Does one need to manually extract the underlying c commands and then somehow make the runtime run them?

This is a very interesting observation, but how do I recreate the experiment?

回答1:

Let's take a look at this python function:

def py_fun(i,N,step):
     res=0.0
     while i<N:
        res+=i
        i+=step
     return res

and use ipython-magic to time it:

In [11]: %timeit py_fun(0.0,1.0e5,1.0)
10 loops, best of 3: 25.4 ms per loop

The interpreter will be running through the resulting bytecode and interpret it. However, we could cut out the interpreter by using cython for/cythonizing the very same code:

%load_ext Cython
%%cython
def cy_fun(i,N,step):
     res=0.0
     while i<N:
        res+=i
        i+=step
     return res

We get a speed up of 50% for it:

In [13]: %timeit cy_fun(0.0,1.0e5,1.0)
100 loops, best of 3: 10.9 ms per loop

When we look into the produced c-code, we see that the right functions are called directly without the need of being interpreted/calling ceval, here after stripping down the boilerplate code:

static PyObject *__pyx_pf_4test_cy_fun(CYTHON_UNUSED PyObject *__pyx_self, PyObject *__pyx_v_i, PyObject *__pyx_v_N, PyObject *__pyx_v_step) {
  ...
  while (1) {
    __pyx_t_1 = PyObject_RichCompare(__pyx_v_i, __pyx_v_N, Py_LT); 
    ...
    __pyx_t_2 = __Pyx_PyObject_IsTrue(__pyx_t_1);
    ...
    if (!__pyx_t_2) break;
    ...
    __pyx_t_1 = PyNumber_InPlaceAdd(__pyx_v_res, __pyx_v_i);
    ...
    __pyx_t_1 = PyNumber_InPlaceAdd(__pyx_v_i, __pyx_v_step); 
  }
  ...
  return __pyx_r;
}

However, this cython function handles python-objects and not c-style floats, so in the function PyNumber_InPlaceAdd it is necessary to figure out, what these objects (integer, float, something else?) really are and to dispatch this call to right functions which would do the job.

With help of cython we could also eliminate the need for this dispatch and to call directly the multiplication for floats:

 %%cython
 def c_fun(double i,double N, double step):
      cdef double res=0.0
      while i<N:
         res+=i
         i+=step
      return res

In this version, i, N, step and res are c-style doubles and no longer python objects. So there is no longer need to call dispatch-functions like PyNumber_InPlaceAdd but we can directly call +-operator for double:

static PyObject *__pyx_pf_4test_c_fun(CYTHON_UNUSED PyObject *__pyx_self, double __pyx_v_i, double __pyx_v_N, double __pyx_v_step) {
  ...
  __pyx_v_res = 0.0;  
  ... 
  while (1) {
    __pyx_t_1 = ((__pyx_v_i < __pyx_v_N) != 0);
    if (!__pyx_t_1) break;
    __pyx_v_res = (__pyx_v_res + __pyx_v_i);
    __pyx_v_i = (__pyx_v_i + __pyx_v_step);
  }
  ...
  return __pyx_r;
}

And the result is:

In [15]: %timeit c_fun(0.0,1.0e5,1.0)
10000 loops, best of 3: 148 µs per loop

Now, this is a speed-up of almost 100 compared to the version without interpreter but with dispatch.

Actually, to say, that dispatch+allocation is the bottle neck here (because eliminating it caused a speed-up of almost factor 100) is a fallacy: the interpreter is responsible for more than 50% of the running time (15 ms) and dispatch and allocation "only" for 10ms.


However, there are more problems than "interpreter" and dynamic dispatch for the performance: Float is immutable, so every time it changes a new object must be created and registered/unregistered in garbage collector.

We can introduce mutable floats, which are changed in place and don't need registering/unregistering:

%%cython
cdef class MutableFloat: 
 cdef double x      
 def __cinit__(self, x):
    self.x=x         
 def __iadd__(self, MutableFloat other):
    self.x=self.x+other.x
    return self
 def __lt__(MutableFloat self,  MutableFloat other):
    return self.x<other.x
 def __gt__(MutableFloat self, MutableFloat other):
    return self.x>other.x
 def __repr__(self):
    return str(self.x)

The timings (now I use a different machine, so the timings a little bit different):

def py_fun(i,N,step,acc):
        while i<N:
             acc+=i
             i+=step
        return acc

%timeit py_fun(1.0, 5e5,1.0,0.0)
30.2 ms ± 1.12 ms per loop (mean ± std. dev. of 7 runs, 10 loops each 
%timeit cy_fun(1.0, 5e5,1.0,0.0)
16.9 ms ± 612 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit i,N,step,acc=MutableFloat(1.0),MutableFloat(5e5),MutableFloat(1
    ...: .0),MutableFloat(0.0); py_fun(i,N,step,acc)
23 ms ± 254 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit i,N,step,acc=MutableFloat(1.0),MutableFloat(5e5),MutableFloat(1
...: .0),MutableFloat(0.0); cy_fun(i,N,step,acc)
11 ms ± 66.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Don't forget to reinitialize i because it is mutable! The results

            immutable       mutable
 py_fun       30ms           23ms
 cy_fun       17ms           11ms

So up to 7ms (about 20%) are needed for registering/unregistering floats (I'm not sure there is not something else playing a role) in the version with the interpreter and more then 33% in the version without the interpreter.

As it looks now:

  • 40% (13/30) of the time is used by interpreter
  • up to 33% of the time is used for the dynamic dispatch
  • up to 20% of the time is used for creating/deleting temporary objects
  • about 1% for the arithmetical operations

Another problem is the locality of the data, which becomes obvious for memory band-width bound problems: The modern caches work well for if data processed linearly one consecutive memory address after another. This is true for looping over std::vector<> (or array.array), but not for looping over python lists, because this list consists of pointers which can point to any place in the memory.

Consider the following python scripts:

#list.py
N=int(1e7)
lst=[0]*int(N)
for i in range(N):
  lst[i]=i
print(sum(lst)) 

and

#byte
N=int(1e7)
b=bytearray(8*N)
m=memoryview(b).cast('L') #reinterpret as an array of unsigned longs
for i in range(N):
  m[i]=i
print(sum(m))

they both create 1e7 integers, the first version Python-integers and the second the lowly c-ints which are placed continuously in the memory.

The interesting part is, how many cache misses (D) these scripts produce:

valgrind --tool=cachegrind python list.py 
...
D1  misses:        33,964,276  (   27,473,138 rd   +     6,491,138 wr)

versus

valgrind --tool=cachegrind python bytearray.py 
...
D1  misses:         4,796,626  (    2,140,357 rd   +     2,656,269 wr)

That means 8 time more cache misses for the python-integers. Some part of it is due to the fact, that python integers need more than 8 bytes (probably 32bytes, i.e. factor 4) memory and (maybe, not 100% sure, because neighboring integers are created after each other, so the chances are high, they are stored after each other somewhere in memory, further investigation needed) some due to the fact, that they aren't aligned in memory as it is the case for c-integers of bytearray.