I'm using a lot of 3D memoryviews in Cython, e.g.
cython.declare(a='double[:, :, ::1]')
a = np.empty((10, 20, 30), dtype='double')
I often want to loop over all elements of a
. I can do this using a triple loop like
for i in range(a.shape[0]):
for j in range(a.shape[1]):
for k in range(a.shape[2]):
a[i, j, k] = ...
If I do not care about the indices i
, j
and k
, it is more efficient to do a flat loop, like
cython.declare(a_ptr='double*')
a_ptr = cython.address(a[0, 0, 0])
for i in range(size):
a_ptr[i] = ...
Here I need to know the number of elements (size
) in the array. This is given by the product of the elements in the shape
attribute, i.e. size = a.shape[0]*a.shape[1]*a.shape[2]
, or more generally size = np.prod(np.asarray(a).shape)
. I find both of these ugly to write, and the (albeit small) computational overhead bothers me. The nice way to do it is to use the builtin size
attribute of memoryviews, size = a.size
. However, for reasons I cannot fathom, this leads to unoptimized C code, as evident from the annotations html file generated by Cython. Specifically, the C code generated by size = a.shape[0]*a.shape[1]*a.shape[2]
is simply
__pyx_v_size = (((__pyx_v_a.shape[0]) * (__pyx_v_a.shape[1])) * (__pyx_v_a.shape[2]));
where the C code generated from size = a.size
is
__pyx_t_10 = __pyx_memoryview_fromslice(__pyx_v_a, 3, (PyObject *(*)(char *)) __pyx_memview_get_double, (int (*)(char *, PyObject *)) __pyx_memview_set_double, 0);; if (unlikely(!__pyx_t_10)) __PYX_ERR(0, 2238, __pyx_L1_error)
__Pyx_GOTREF(__pyx_t_10);
__pyx_t_14 = __Pyx_PyObject_GetAttrStr(__pyx_t_10, __pyx_n_s_size); if (unlikely(!__pyx_t_14)) __PYX_ERR(0, 2238, __pyx_L1_error)
__Pyx_GOTREF(__pyx_t_14);
__Pyx_DECREF(__pyx_t_10); __pyx_t_10 = 0;
__pyx_t_7 = __Pyx_PyIndex_AsSsize_t(__pyx_t_14); if (unlikely((__pyx_t_7 == (Py_ssize_t)-1) && PyErr_Occurred())) __PYX_ERR(0, 2238, __pyx_L1_error)
__Pyx_DECREF(__pyx_t_14); __pyx_t_14 = 0;
__pyx_v_size = __pyx_t_7;
To generate the above code, I have enabled all possible optimizations through compiler directives, meaning that the unwieldy C code generated by a.size
cannot be optimized away. It looks to me as though the size
"attribute" is not really a pre-computed attribute, but actually carries out a computation upon lookup. Furthermore, this computation is quite a bit more involved than simply taking the product over the shape
attribute. I cannot find any hint of an explanation in the docs.
What is the explanation of this behavior, and do I have a better choice than writing out a.shape[0]*a.shape[1]*a.shape[2]
, if I really care about this micro optimization?
Already by looking at the produced C-code, you can already see that
size
is a property and not a simple C-member. Here is the original Cython-code for memory-views:It is easy to see, that the product is calculated only once and then cached. Clearly it doesn't play a big role for 3 dimensional arrays, but for a higher number of dimensions caching could become pretty important (as we will see, there are at most 8 dimensions, so it is not that clearly cut, whether this caching is really worth it).
One can understand the decision to lazily calculate the
size
- after all,size
is not always needed/used and one doesn't want to pay for it. Clearly, there is a price to pay for this laziness if you use thesize
a lot - that is the trade off cython makes.I would not dwell too long on the overhead of calling
a.size
- it is nothing compared to the overhead of calling a cython-function from python.For example, the measurements of @danny measure only this python-call overhead and not the actual performance of the different approaches. To show this, I throw a third function into the mix:
which does double amount of the work, but
is just as fast. On the other hand:
isn't faster:
In a nutshell: I would use
a.size
because of the readability, assuming that optimizing that would not speed up my application, unless profiling proves something different.The whole story: the variable
a
is of type__Pyx_memviewslice
and not of type__pyx_memoryview
as one could think. The struct__Pyx_memviewslice
has the following definition:that means,
shape
can be accessed very efficiently by the Cython-code, as it is a simple C-array (btw. I ask my self, what happens if there are more than 8 dimensions? - the answer is: you cannot have more than 8 dimensions).The member
memview
is where the memory is hold and__pyx_memoryview_obj
is the C-Extension which is produce from the cython-code we saw above and looks as follows:So,
Pyx_memviewslice
is not really a Python object -it is kind of convenience wrapper, which caches important data, likeshape
andstride
so this information can be accessed fast and cheap.What happens when we call
a.size
? First,__pyx_memoryview_fromslice
is called which does some additional reference counting and some further stuff and returns the membermemview
from the__Pyx_memviewslice
-object.Then the property
size
is called on this returned memoryview, which accesses the cached value in_size
as have been shown in the Cython code above.It looks as if the python-programmers introduced a shortcut for such important information as
shape
,strides
andsuboffsets
, but not for thesize
which is probably not so important - this is the reason for cleaner C-code in the case ofshape
.The generated C code for
a.size
looks fine.It has to interface with Python because memory views are python extension types.
size
on the memory view is a python attribute and gets converted tossize_t
. That is all the C code does. The conversion can be avoided by typing thesize
variable asPy_ssize_t
rather thanssize_t
.So there is not anything in the C code that looks unoptimised - it's just looking up an attribute on a python object, size on a memory view in this case.
Here are results of micro-benchmark for the two methods.
Setup:
Results:
Performance is pretty much identical.
The product method is purely C code which matters if it needs to be executed in parallel, but otherwise there is no performance benefit over memory view
size
.