Converting numpy arrays of code points to and from

2020-07-26 18:40发布

问题:

I have a long unicode string:

alphabet = range(0x0FFF)
mystr = ''.join(chr(random.choice(alphabet)) for _ in range(100))
mystr = re.sub('\W', '', mystr)

I would like to view it as a series of code points, so at the moment, I am doing the following:

arr = np.array(list(mystr), dtype='U1')

I would like to be able to manipulate the string as numbers, and eventually get some different code points back. Now I'd like to invert the transformation:

mystr = ''.join(arr.tolist())

These transformations are reasonably fast and invertible, but take up an unnecessary amount of space with the list intermediary.

Is there a way to convert a numpy array of unicode characters to and from a Python string without converting to a list first?

Afterthoughts

I can get arr to appear as a single string with something like

buf = arr.view(dtype='U' + str(arr.size))

This results in a 1-element array containing the entire original. The inverse is possible as well:

buf.view(dtype='U1')

The only issue is that the type of the result is np.str_, not str.

回答1:

fromiter works, but is really slow, since it goes through the iterator protocol. It's much faster to encode your data to UTF-32 (in system byte order) and use numpy.frombuffer:

In [56]: x = ''.join(chr(random.randrange(0x0fff)) for i in range(1000))

In [57]: codec = 'utf-32-le' if sys.byteorder == 'little' else 'utf-32-be'

In [58]: %timeit numpy.frombuffer(bytearray(x, codec), dtype='U1')
2.79 µs ± 47 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [59]: %timeit numpy.fromiter(x, dtype='U1', count=len(x))
122 µs ± 3.82 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [60]: numpy.array_equal(numpy.fromiter(x, dtype='U1', count=len(x)), numpy.fr
    ...: ombuffer(bytearray(x, codec), dtype='U1'))
Out[60]: True

I've used sys.byteorder to determine whether to encode in utf-32-le or utf-32-be. Also, using bytearray instead of encode gets a mutable bytearray instead of an immutable bytes object, so the resulting array is writable.


As for the reverse conversion, arr.view(dtype=f'U{arr.size}')[0] works, but using item() is a bit faster and produces an ordinary string object, avoiding possible weird edge cases where numpy.str_ doesn't quite behave like str:

In [72]: a = numpy.frombuffer(bytearray(x, codec), dtype='U1')

In [73]: type(a.view(dtype=f'U{a.size}')[0])
Out[73]: numpy.str_

In [74]: type(a.view(dtype=f'U{a.size}').item())
Out[74]: str

In [75]: %timeit a.view(dtype=f'U{a.size}')[0]
3.63 µs ± 34 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [76]: %timeit a.view(dtype=f'U{a.size}').item()
2.14 µs ± 23.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Finally, be aware that NumPy doesn't handle nulls like normal Python string objects do. NumPy can't distinguish between 'asdf\x00\x00\x00' and 'asdf', so using NumPy arrays for string operations is not safe if your data may contain null code points.



回答2:

The fastest way I have found to convert a string to an array is

arr = np.array([mystr]).view(dtype='U1')

Another (slower) way to convert a string to an array of unicode code points based on @Daniel Mesejo's comment:

arr = np.fromiter(mystr, dtype='U1', count=len(mystr))

Looking at the source code for fromiter shows that setting the count parameter to the length of the string will cause the entire array to be allocated at once, instead of performing multiple reallocations.

To convert back to a string:

str(arr.view(dtype=f'U{arr.size}')[0])

For most purposes, the final conversion to Python str is not necessary since np.str_ is a subclass of str.

arr.view(dtype=f'U{arr.size}')[0]

Appendix: Timing of frombuffer vs array

100

mystr = ''.join(chr(random.choice(range(1, 0x1000))) for _ in range(100))

%timeit np.array([mystr]).view(dtype='U1')
1.43 µs ± 27.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

%timeit np.frombuffer(bytearray(mystr, 'utf-32-le'), dtype='U1')
1.2 µs ± 9.06 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

10000

mystr = ''.join(chr(random.choice(range(1, 0x1000))) for _ in range(10000))

%timeit np.array([mystr]).view(dtype='U1')
4.33 µs ± 13.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%timeit np.frombuffer(bytearray(mystr, 'utf-32-le'), dtype='U1')
10.9 µs ± 29.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

1000000

mystr = ''.join(chr(random.choice(range(1, 0x1000))) for _ in range(1000000))

%timeit np.array([mystr]).view(dtype='U1')
672 µs ± 1.64 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit np.frombuffer(bytearray(mystr, 'utf-32-le'), dtype='U1')
732 µs ± 5.22 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)