I'm new to cython, and I've been having a re-ocurring problem involving encoding unicode inside of a numpy array.
Here's an example of the problem:
import numpy as np
cimport numpy as np
cpdef pass_array(np.ndarray[ndim=1,dtype=np.unicode] a):
pass
cpdef access_unicode_item(np.ndarray a):
cdef unicode item = a[0]
Example errors:
In [3]: unicode_array = np.array([u"array",u"of",u"unicode"],dtype=np.unicode)
In [4]: pass_array(unicode_array)
ValueError: Does not understand character buffer dtype format string ('w')
In [5]: access_item(unicode_array)
TypeError: Expected unicode, got numpy.unicode_
The problem seems to be that the values are not real unicode, but instead numpy.unicode_ . Is there a way to encode the values in the array as proper unicode (so that I can type individual items for use in cython code)?
In Py2.7
In [375]: arr=np.array([u"array",u"of",u"unicode"],dtype=np.unicode)
In [376]: arr
Out[376]:
array([u'array', u'of', u'unicode'],
dtype='<U7')
In [377]: arr.dtype
Out[377]: dtype('<U7')
In [378]: type(arr[0])
Out[378]: numpy.unicode_
In [379]: type(arr[0].item())
Out[379]: unicode
In general x[0]
returns an element of x
in a numpy subclass. In this case np.unicode_
is a subclass of unicode
.
In [384]: isinstance(arr[0],np.unicode_)
Out[384]: True
In [385]: isinstance(arr[0],unicode)
Out[385]: True
I think you'd encounter the same sort of issues between np.int32
and int
. But I haven't worked enough with cython to be sure.
Where have you seen cython
code that specifies a string (unicode or byte) dtype?
http://docs.cython.org/src/tutorial/numpy.html has expressions like
# We now need to fix a datatype for our arrays. I've used the variable
# DTYPE for this, which is assigned to the usual NumPy runtime
# type info object.
DTYPE = np.int
# "ctypedef" assigns a corresponding compile-time type to DTYPE_t. For
# every type in the numpy module there's a corresponding compile-time
# type with a _t-suffix.
ctypedef np.int_t DTYPE_t
....
def naive_convolve(np.ndarray[DTYPE_t, ndim=2] f):
The purpose of the []
part is to improve indexing efficiency.
What we need to do then is to type the contents of the ndarray objects. We do this with a special “buffer” syntax which must be told the datatype (first argument) and number of dimensions (“ndim” keyword-only argument, if not provided then one-dimensional is assumed).
I don't think np.unicode
will help because it doesn't specify character length. The full string dtype has to include the number of characters, eg. <U7
in my example.
We need to find working examples which pass string arrays - either in the cython documentation or other SO cython questions.
For some operations, you could treat the unicode array as an array of int32
.
In [397]: arr.nbytes
Out[397]: 84
3 strings x 7 char/string * 4bytes/char
In [398]: arr.view(np.int32).reshape(-1,7)
Out[398]:
array([[ 97, 114, 114, 97, 121, 0, 0],
[111, 102, 0, 0, 0, 0, 0],
[117, 110, 105, 99, 111, 100, 101]])
Cython gives you the greatest speed improvement when you can bypass Python functions and methods. That would include bypassing much of the Python string and unicode functionality.