I'm new to cython, and I've been having a re-ocurring problem involving encoding unicode inside of a numpy array.
Here's an example of the problem:
import numpy as np
cimport numpy as np
cpdef pass_array(np.ndarray[ndim=1,dtype=np.unicode] a):
pass
cpdef access_unicode_item(np.ndarray a):
cdef unicode item = a[0]
Example errors:
In [3]: unicode_array = np.array([u"array",u"of",u"unicode"],dtype=np.unicode)
In [4]: pass_array(unicode_array)
ValueError: Does not understand character buffer dtype format string ('w')
In [5]: access_item(unicode_array)
TypeError: Expected unicode, got numpy.unicode_
The problem seems to be that the values are not real unicode, but instead numpy.unicode_ . Is there a way to encode the values in the array as proper unicode (so that I can type individual items for use in cython code)?
In Py2.7
In general
x[0]
returns an element ofx
in a numpy subclass. In this casenp.unicode_
is a subclass ofunicode
.I think you'd encounter the same sort of issues between
np.int32
andint
. But I haven't worked enough with cython to be sure.Where have you seen
cython
code that specifies a string (unicode or byte) dtype?http://docs.cython.org/src/tutorial/numpy.html has expressions like
The purpose of the
[]
part is to improve indexing efficiency.I don't think
np.unicode
will help because it doesn't specify character length. The full string dtype has to include the number of characters, eg.<U7
in my example.We need to find working examples which pass string arrays - either in the cython documentation or other SO cython questions.
For some operations, you could treat the unicode array as an array of
int32
.3 strings x 7 char/string * 4bytes/char
Cython gives you the greatest speed improvement when you can bypass Python functions and methods. That would include bypassing much of the Python string and unicode functionality.