Cython specify numpy array of fixed length strings

2020-07-20 04:18发布

问题:

I have a function that I'd like to use Cython with that involves processing large numbers of fixed-length strings. For a standard cython function, I can declare the types of arrays like so:

cpdef double[:] g(double[:] in_arr):
    cdef double[:] out_arr = np.zeros(in_arr.shape, dtype='float64')

    cdef i
    for i in range(len(in_arr)):
        out_arr[i] = in_arr[i]

    return out_arr

This compiles and works as expected when the dtype is something simple like int32, float, double, etc. However, I cannot figure out how to create a typed memoryview of fixed-length strings - i.e. the equivalent of np.dtype('a5'), for example.

If I use this:

cpdef str[:] f(str[:] in_arr):
    # arr should be a numpy array of 5-character strings
    cdef str[:] out_arr = np.zeros(in_arr.shape, dtype='a5')

    cdef i
    for i in range(len(in_arr)):
        out_arr[i] = in_arr[i]

    return out_arr

The function compiles, but this:

in_arr = np.array(['12345', '67890', '22343'], dtype='a5')
f(in_arr)

Throws the following error:

---> 16 cpdef str[:] f(str[:] in_arr): 17 # arr should be a numpy array of 5-character strings 18 cdef str[:] out_arr = np.zeros(in_arr.shape, dtype='a5')

ValueError: Buffer dtype mismatch, expected 'unicode object' but got a string

Similarly if I use bytes[:], it gives the error "Buffer dtype mismatch, expected 'bytes object' but got a string" - and this doesn't even get to the issue with the fact that nowhere am I specifying that these strings have length 6.

Interestingly, I can include fixed-length strings in a structured type as in this question, but I don't think that's the right way to declare the types.

回答1:

In a Python3 session, your a5 array contains bytestrings.

In [165]: np.array(['12345', '67890', '22343'], dtype='a5')
Out[165]: 
array([b'12345', b'67890', b'22343'], 
      dtype='|S5')

http://cython.readthedocs.io/en/latest/src/tutorial/strings.html says that str is unicode string type when compiled with Python3.

I suspect that np.array(['12345', '67890', '22343'], dtype='U5') would be accepted as the input array for your function. But copying to the a5 out_arr would have problems.

object version

An object version of this loop works:

cpdef str[:] objcopy(str[:] in_arr):
    cdef str[:] out_arr = np.zeros(in_arr.shape[0], dtype=object)
    cdef int N
    N = in_arr.shape[0]
    for i in range(N):
        out_arr[i] = in_arr[i]
    return out_arr

narr = np.array(['one','two','three'], dtype=object)
cpy = objcopy(narr)
print(cpy)
print(np.array(cpy))
print(np.array(objcopy(np.array([None,'one', 23.4]))))

These functions return a memoryview, which has to be converted to array to print.

single char version

Single byte memoryview copy:

cpdef char[:] chrcopy(char[:] in_arr):
    cdef char[:] out_arr = np.zeros(in_arr.shape[0], dtype='uint8')
    cdef int N
    N = in_arr.shape[0]
    for i in range(N):
        out_arr[i] = in_arr[i]
    return out_arr
print(np.array(chrcopy(np.array([b'one',b'two',b'three']).view('S1'))).view('S5'))

Uses view to convert strings to single bytes and back.

2d unicode version

I looked into this issue last year: Cython: storing unicode in numpy array

This processes unicode strings as though they were rows of a 2d int array; reshape is needed before and after.

cpdef int[:,:] int2dcopy(int[:,:] in_arr):
    cdef int[:,:] out_arr = np.zeros((in_arr.shape[0], in_arr.shape[1]), dtype=int)
    cdef int N
    N = in_arr.shape[0]
    for i in range(N):
        out_arr[i,:] = in_arr[i,:]
    return out_arr

narr = np.array(['one','two','three', 'four', 'five'], dtype='U5')
cpy = int2dcopy(narr.view('int').reshape(-1,5))
print(cpy)
print(np.array(cpy))
print(np.array(cpy).view(narr.dtype)) # .reshape(-1)

For bytestrings a similar 2d char version should work.

c struct version

byte5 = cython.struct(x=cython.char[5])
cpdef byte5[:] byte5copy(byte5[:] in_arr):
    cdef byte5[:] out_arr = np.zeros(in_arr.shape[0], dtype='|S5')
    cdef int N
    N = in_arr.shape[0]
    for i in range(N):
        out_arr[i] = in_arr[i]
    return out_arr

narr = np.array(['one','four','six'], dtype='|S5')
cpy = byte5copy(narr)
print(cpy)
print(repr(np.array(cpy)))
# array([b'one', b'four', b'six'], dtype='|S5')

The C struct is creating a memoryview with 5 byte elements, which map onto array S5 elements.

https://github.com/cython/cython/blob/master/tests/memoryview/numpy_memoryview.pyx also has a structured array example with bytestrings.