I have a long unicode string:
alphabet = range(0x0FFF)
mystr = ''.join(chr(random.choice(alphabet)) for _ in range(100))
mystr = re.sub('\W', '', mystr)
I would like to view it as a series of code points, so at the moment, I am doing the following:
arr = np.array(list(mystr), dtype='U1')
I would like to be able to manipulate the string as numbers, and eventually get some different code points back. Now I'd like to invert the transformation:
mystr = ''.join(arr.tolist())
These transformations are reasonably fast and invertible, but take up an unnecessary amount of space with the list
intermediary.
Is there a way to convert a numpy array of unicode characters to and from a Python string without converting to a list first?
Afterthoughts
I can get arr
to appear as a single string with something like
buf = arr.view(dtype='U' + str(arr.size))
This results in a 1-element array containing the entire original. The inverse is possible as well:
buf.view(dtype='U1')
The only issue is that the type of the result is np.str_
, not str
.
fromiter
works, but is really slow, since it goes through the iterator protocol. It's much faster to encode your data to UTF-32 (in system byte order) and use numpy.frombuffer
:
In [56]: x = ''.join(chr(random.randrange(0x0fff)) for i in range(1000))
In [57]: codec = 'utf-32-le' if sys.byteorder == 'little' else 'utf-32-be'
In [58]: %timeit numpy.frombuffer(bytearray(x, codec), dtype='U1')
2.79 µs ± 47 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [59]: %timeit numpy.fromiter(x, dtype='U1', count=len(x))
122 µs ± 3.82 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [60]: numpy.array_equal(numpy.fromiter(x, dtype='U1', count=len(x)), numpy.fr
...: ombuffer(bytearray(x, codec), dtype='U1'))
Out[60]: True
I've used sys.byteorder
to determine whether to encode in utf-32-le
or utf-32-be
. Also, using bytearray
instead of encode
gets a mutable bytearray instead of an immutable bytes object, so the resulting array is writable.
As for the reverse conversion, arr.view(dtype=f'U{arr.size}')[0]
works, but using item()
is a bit faster and produces an ordinary string object, avoiding possible weird edge cases where numpy.str_
doesn't quite behave like str
:
In [72]: a = numpy.frombuffer(bytearray(x, codec), dtype='U1')
In [73]: type(a.view(dtype=f'U{a.size}')[0])
Out[73]: numpy.str_
In [74]: type(a.view(dtype=f'U{a.size}').item())
Out[74]: str
In [75]: %timeit a.view(dtype=f'U{a.size}')[0]
3.63 µs ± 34 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [76]: %timeit a.view(dtype=f'U{a.size}').item()
2.14 µs ± 23.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Finally, be aware that NumPy doesn't handle nulls like normal Python string objects do. NumPy can't distinguish between 'asdf\x00\x00\x00'
and 'asdf'
, so using NumPy arrays for string operations is not safe if your data may contain null code points.
The fastest way I have found to convert a string to an array is
arr = np.array([mystr]).view(dtype='U1')
Another (slower) way to convert a string to an array of unicode code points based on @Daniel Mesejo's comment:
arr = np.fromiter(mystr, dtype='U1', count=len(mystr))
Looking at the source code for fromiter
shows that setting the count
parameter to the length of the string will cause the entire array to be allocated at once, instead of performing multiple reallocations.
To convert back to a string:
str(arr.view(dtype=f'U{arr.size}')[0])
For most purposes, the final conversion to Python str
is not necessary since np.str_
is a subclass of str
.
arr.view(dtype=f'U{arr.size}')[0]
Appendix: Timing of frombuffer
vs array
100
mystr = ''.join(chr(random.choice(range(1, 0x1000))) for _ in range(100))
%timeit np.array([mystr]).view(dtype='U1')
1.43 µs ± 27.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit np.frombuffer(bytearray(mystr, 'utf-32-le'), dtype='U1')
1.2 µs ± 9.06 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
10000
mystr = ''.join(chr(random.choice(range(1, 0x1000))) for _ in range(10000))
%timeit np.array([mystr]).view(dtype='U1')
4.33 µs ± 13.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit np.frombuffer(bytearray(mystr, 'utf-32-le'), dtype='U1')
10.9 µs ± 29.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
1000000
mystr = ''.join(chr(random.choice(range(1, 0x1000))) for _ in range(1000000))
%timeit np.array([mystr]).view(dtype='U1')
672 µs ± 1.64 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.frombuffer(bytearray(mystr, 'utf-32-le'), dtype='U1')
732 µs ± 5.22 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)