I have successfully used Cython for the first time to significantly speed up packing nibbles from one list of integers (bytes) into another (see Faster bit-level data packing), e.g. packing the two sequential bytes 0x0A
and 0x0B
into 0xAB
.
def pack(it):
"""Cythonize python nibble packing loop, typed"""
cdef unsigned int n = len(it)//2
cdef unsigned int i
return [ (it[i*2]//16)<<4 | it[i*2+1]//16 for i in range(n) ]
While the resulting speed is satisfactory, I am curious whether this can be taken further by making better use of the input and output lists.
cython3 -a pack.cyx
generates a very "cythonic" HTML report that I unfortunately am not experienced enough to draw any useful conclusions from.
From a C point of view the loop should "simply" access two unsigned int arrays. Possibly, using a wider data type (16/32 bit) could further speed this up proportionally.
The question is: (how) can Python [binary/immutable] sequence types be typed as unsigned int array
for Cython?
Using array as suggested in How to convert python array to cython array? does not seem to make it faster (and the array needs to be created from bytes object beforehand), nor does typing the parameter as list
instead of object
(same as no type) or using for loop instead of list comprehension:
def packx(list it):
"""Cythonize python nibble packing loop, typed"""
cdef unsigned int n = len(it)//2
cdef unsigned int i
cdef list r = [0]*n
for i in range(n):
r[i] = (it[i*2]//16)<<4 | it[i*2+1]//16
return r
I think my earlier test just specified an array.array as input, but following the comments I've now just tried
from cpython cimport array
import array
def packa(array.array a):
"""Cythonize python nibble packing loop, typed"""
cdef unsigned int n = len(a)//2
cdef unsigned int i
cdef unsigned int b[256*64/2]
for i in range(n):
b[i] = (a[i*2]//16)<<4 | a[i*2+1]//16;
cdef array.array c = array.array("B", b)
return c
which compiles but
ima = array.array("B", imd) # unsigned char (1 Byte)
pa = packa(ima)
packed = pa.tolist()
segfaults. I find the documentation a bit sparse, so any hints on what the problem is here and how to allocate the array for output data are appreciated.
Taking @ead's first approach, plus combining division and shifting (seems to save a microsecond:
#cython: boundscheck=False, wraparound=False
def packa(char[::1] a):
"""Cythonize python nibble packing loop, typed with array"""
cdef unsigned int n = len(a)//2
cdef unsigned int i
# cdef unsigned int b[256*64/2]
cdef array.array res = array.array('B', [])
array.resize(res, n)
for i in range(n):
res.data.as_chars[i] = ( a[i*2] & 0xF0 ) | (a[i*2+1] >> 4);
return res
compiles much longer, but runs much faster:
python3 -m timeit -s 'from pack import packa; import array; data = array.array("B", bytes([0]*256*64))' 'packa(data)'
1000 loops, best of 3: 236 usec per loop
Amazing! But, with the additional bytes-to-array and array-to-list conversion
ima = array.array("B", imd) # unsigned char (1 Byte)
pa = packa(ima)
packed = pa.tolist() # bytes would probably also do
it now only takes about 1.7 ms - very cool!
Down to 150 us timed or approx. 0.4 ms actual:
from cython cimport boundscheck, wraparound
from cpython cimport array
import array
@boundscheck(False)
@wraparound(False)
def pack(const unsigned char[::1] di):
cdef:
unsigned int i, n = len(di)
unsigned char h, l, r
array.array do = array.array('B')
array.resize(do, n>>1)
for i in range(0, n, 2):
h = di[i] & 0xF0
l = di[i+1] >> 4
r = h | l
do.data.as_uchars[i>>1] = r
return do
I'm not converting the result array to a list anymore, this is done automatically by py-spidev when writing, and the total time is about the same: 10 ms (@ 10 MHz).
If you wanna to be as fast as C you should not use list with python-integers inside but an
array.array
. It is possible to get a speed-up of around 140 for your python+list code by using cython+array.array
.Here are some ideas how to make your code faster with cython. As benchmark I choose a list with 1000 elements (big enough and cache-misses have no effects yet):
As baseline, your python-implementation with list:
By the way, I use
%
instead of//
, which is probably what you want, otherwise you would get only0
s as result (only lower bits have data in your description).After cythonizing the same function (with
%%cython
-magic) we get a speed-up of around 2:Let's look at the html produced by option
-a
, we see the following for the line corresponding to thefor
-loop:Py_NumberMultiply
means that we use slow python-multiplication,Pyx_DECREF
- all temporaries are slow python-objects. We need to change that!Let's pass not a list but an
array.array
of bytes to our function and return anarray.array
of bytes back. Lists have full fledged python objects inside,array.array
the lowly raw c-data which is faster:Better, but let take a look at the generated html, there is still some slow python-code:
We still use a python-setter for array (
__Pax_SetItemInt
) and for this a python objecct__pyx_t_2
is needed, to avoid this we usearray.data.as_chars
:Much better, but let's take a look at html again, and we see some calls to
__Pyx_RaiseBufferIndexError
- this safety costs some time, so let's switch it off:When we look at the generated html, we see:
No python-stuff! Good so far. However, I'm not sure about
__Pyx_mod_long
, its definition is:So C and Python have differences for
mod
of negative numbers and it must be taken into account. This function-definition, albeit inlined, will prevent the C-compiler from optimizinga%16
asa&15
. We have only positive numbers, so no need to care about them, thus we need to do thea&15
-trick by ourselves:I'm also satified with the resulting C-code/html (only one line):
Conclusion: In the sum that means speed up of 140 (140 µs vs 1.02 µs)- not bad! Another interesting point: the calculation itself takes about 2 µs (and that comprises less than optimal bound checking and division) - 138 µs are for creating, registering and deleting temporary python objects.
If you need the upper bits and can assume that lower bits are without dirt (otherwise
&250
can help), you can use:Another interesting question is which costs have the operations if list is used. If we start with the "improved" version:
we see, that reducing the number of integer operation leads to a big speed-up. That is due to the fact, that python-integers are immutable and every operation creates a new temporary object, which is costly. Eliminating operations means also eliminating costly temporaries.
However,
it[i*2] | (it[i*2+1]>>4)
is done with python-integer, as next step we make itcdef
-operations:I don't know how it can be improved further, thus we have 7.3 µs for list vs. 1 µs for
array.array
.Last question, what is the costs break down of the list version? I order to avoid being optimized away by the C-compiler, we use a slightly different baseline function:
The usage of the
s
variable means, it does not get optimized away in the second version:About 2 µs or about 30% are the costs for creating new integer objects. What are the costs of the memory allocation?
That leads to the following performance break down of the list-version:
I must confess, I expected
create ints
to play a bigger role and didn't thing accessing the data in the list and casting it tochar
s will cost that much.