I have a function that I'd like to use Cython with that involves processing large numbers of fixed-length strings. For a standard cython function, I can declare the types of arrays like so:
cpdef double[:] g(double[:] in_arr):
cdef double[:] out_arr = np.zeros(in_arr.shape, dtype='float64')
cdef i
for i in range(len(in_arr)):
out_arr[i] = in_arr[i]
return out_arr
This compiles and works as expected when the dtype is something simple like int32
, float
, double
, etc. However, I cannot figure out how to create a typed memoryview of fixed-length strings - i.e. the equivalent of np.dtype('a5')
, for example.
If I use this:
cpdef str[:] f(str[:] in_arr):
# arr should be a numpy array of 5-character strings
cdef str[:] out_arr = np.zeros(in_arr.shape, dtype='a5')
cdef i
for i in range(len(in_arr)):
out_arr[i] = in_arr[i]
return out_arr
The function compiles, but this:
in_arr = np.array(['12345', '67890', '22343'], dtype='a5')
f(in_arr)
Throws the following error:
---> 16 cpdef str[:] f(str[:] in_arr):
17 # arr should be a numpy array of 5-character strings
18 cdef str[:] out_arr = np.zeros(in_arr.shape, dtype='a5')
ValueError: Buffer dtype mismatch, expected 'unicode object' but got a
string
Similarly if I use bytes[:]
, it gives the error "Buffer dtype mismatch, expected 'bytes object' but got a string" - and this doesn't even get to the issue with the fact that nowhere am I specifying that these strings have length 6.
Interestingly, I can include fixed-length strings in a structured type as in this question, but I don't think that's the right way to declare the types.
In a Python3 session, your a5
array contains bytestrings.
In [165]: np.array(['12345', '67890', '22343'], dtype='a5')
Out[165]:
array([b'12345', b'67890', b'22343'],
dtype='|S5')
http://cython.readthedocs.io/en/latest/src/tutorial/strings.html
says that str
is unicode string type when compiled with Python3.
I suspect that np.array(['12345', '67890', '22343'], dtype='U5')
would be accepted as the input array for your function. But copying to the a5
out_arr
would have problems.
object version
An object version of this loop works:
cpdef str[:] objcopy(str[:] in_arr):
cdef str[:] out_arr = np.zeros(in_arr.shape[0], dtype=object)
cdef int N
N = in_arr.shape[0]
for i in range(N):
out_arr[i] = in_arr[i]
return out_arr
narr = np.array(['one','two','three'], dtype=object)
cpy = objcopy(narr)
print(cpy)
print(np.array(cpy))
print(np.array(objcopy(np.array([None,'one', 23.4]))))
These functions return a memoryview, which has to be converted to array to print.
single char version
Single byte memoryview copy:
cpdef char[:] chrcopy(char[:] in_arr):
cdef char[:] out_arr = np.zeros(in_arr.shape[0], dtype='uint8')
cdef int N
N = in_arr.shape[0]
for i in range(N):
out_arr[i] = in_arr[i]
return out_arr
print(np.array(chrcopy(np.array([b'one',b'two',b'three']).view('S1'))).view('S5'))
Uses view
to convert strings to single bytes and back.
2d unicode version
I looked into this issue last year: Cython: storing unicode in numpy array
This processes unicode strings as though they were rows of a 2d int array; reshape is needed before and after.
cpdef int[:,:] int2dcopy(int[:,:] in_arr):
cdef int[:,:] out_arr = np.zeros((in_arr.shape[0], in_arr.shape[1]), dtype=int)
cdef int N
N = in_arr.shape[0]
for i in range(N):
out_arr[i,:] = in_arr[i,:]
return out_arr
narr = np.array(['one','two','three', 'four', 'five'], dtype='U5')
cpy = int2dcopy(narr.view('int').reshape(-1,5))
print(cpy)
print(np.array(cpy))
print(np.array(cpy).view(narr.dtype)) # .reshape(-1)
For bytestrings a similar 2d char
version should work.
c struct version
byte5 = cython.struct(x=cython.char[5])
cpdef byte5[:] byte5copy(byte5[:] in_arr):
cdef byte5[:] out_arr = np.zeros(in_arr.shape[0], dtype='|S5')
cdef int N
N = in_arr.shape[0]
for i in range(N):
out_arr[i] = in_arr[i]
return out_arr
narr = np.array(['one','four','six'], dtype='|S5')
cpy = byte5copy(narr)
print(cpy)
print(repr(np.array(cpy)))
# array([b'one', b'four', b'six'], dtype='|S5')
The C struct is creating a memoryview with 5 byte elements, which map onto array S5
elements.
https://github.com/cython/cython/blob/master/tests/memoryview/numpy_memoryview.pyx also has a structured array example with bytestrings.