Vectorized string operations in Numpy: why are the

2019-02-14 15:50发布

问题:

This is of those "mostly asked out of pure curiosity (in possibly futile hope I will learn something)" questions.

I was investigating ways of saving memory on operations on massive numbers of strings, and for some scenarios it seems like string operations in numpy could be useful. However, I got somewhat surprising results:

import random
import string

milstr = [''.join(random.choices(string.ascii_letters, k=10)) for _ in range(1000000)]

npmstr = np.array(milstr, dtype=np.dtype(np.unicode_, 1000000))

Memory consumption using memory_profiler:

%memit [x.upper() for x in milstr]
peak memory: 420.96 MiB, increment: 61.02 MiB

%memit np.core.defchararray.upper(npmstr)
peak memory: 391.48 MiB, increment: 31.52 MiB

So far, so good; however, timing results are surprising for me:

%timeit [x.upper() for x in milstr]
129 ms ± 926 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit np.core.defchararray.upper(npmstr)
373 ms ± 2.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Why is that? I expected that since Numpy uses contiguous chunks of memory for its arrays AND its operations are vectorized (as the above numpy doc page says) AND numpy string arrays apparently use less memory so operating on them should at least potentially be more on-CPU cache-friendly, performance on arrays of strings would be at least similar to those in pure Python?

Environment:

Python 3.6.3 x64, Linux

numpy==1.14.1

回答1:

Vectorized is used in two ways when talking about numpy, and it`s not always clear which is meant.

  1. Operations that operate on all elements of an array
  2. Operations that call optimized (and in many cases multi-threaded) numerical code internally

The second point is what makes vectorized operations much faster than a for loop in python, and the multithreaded part is what makes them faster than a list comprehension. When commenters here state that vectorized code is faster, they're referring to the second case as well. However, in the numpy documentation, vectorized only refers to the first case. It means you can use a function directly on an array, without having to loop through all the elements and call it on each elements. In this sense it makes code more concise, but isn't necessarily any faster. Some vectorized operations do call multithreaded code, but as far as I know this is limited to linear algebra routines. Personally, I prefer using vectorized operatios since I think it is more readable than list comprehensions, even if performance is identical.

Now, for the code in question the documentation for np.char (which is an alias for np.core.defchararray), states

The chararray class exists for backwards compatibility with Numarray, it is not recommended for new development. Starting from numpy 1.4, if one needs arrays of strings, it is recommended to use arrays of dtype object_, string_ or unicode_, and use the free functions in the numpy.char module for fast vectorized string operations.

So there are four ways (one not recommended) to handle strings in numpy. Some testing is in order, since certainly each way will have different advantages and disadvantages. Using arrays defined as follows:

npob = np.array(milstr, dtype=np.object_)
npuni = np.array(milstr, dtype=np.unicode_)
npstr = np.array(milstr, dtype=np.string_)
npchar = npstr.view(np.chararray)
npcharU = npuni.view(np.chararray)

This creates arrays (or chararrays for the last two) with the following datatypes:

In [68]: npob.dtype
Out[68]: dtype('O')

In [69]: npuni.dtype
Out[69]: dtype('<U10')

In [70]: npstr.dtype
Out[70]: dtype('S10')

In [71]: npchar.dtype
Out[71]: dtype('S10')

In [72]: npcharU.dtype
Out[72]: dtype('<U10')

The benchmarks give quite a range of performance across these datatypes:

%timeit [x.upper() for x in test]
%timeit np.char.upper(test)

# test = milstr
103 ms ± 1.42 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
377 ms ± 3.67 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# test = npob
110 ms ± 659 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
<error on second test, vectorized operations don't work with object arrays>

# test = npuni
295 ms ± 1.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
323 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# test = npstr
125 ms ± 2.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
125 ms ± 483 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# test = npchar
663 ms ± 4.94 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
127 ms ± 1.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# test = npcharU
887 ms ± 8.13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
325 ms ± 3.23 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Surprisingly, using a plain old list of strings is still the fastest. Numpy is competitive when the datatype is string_ or object_, but once unicode is included performance becomes much worse. The chararray is by far the slowest, wether handling unicode or not. It should be clear why it's not recommended for use.

Using unicode strings has a significant performance penalty. The docs state the following for differences between these types

For backward compatibility with Python 2 the S and a typestrings remain zero-terminated bytes and np.string_ continues to map to np.bytes_. To use actual strings in Python 3 use U or np.unicode_. For signed bytes that do not need zero-termination b or i1 can be used.

In this case, where the character set does not require unicode it would make sense to use the faster string_ type. If unicode was needed, you may get better performance by using a list, or a numpy array of type object_ if other numpy functionality is needed. Another good example of when a list may be better is appending lots of data

So, takeaways from this:

  1. Python, while generally accepted as slow, is very performant for some common things. Numpy is generally quite fast, but is not optimized for everything.
  2. Read the docs. If there's more than one way of doing things (and there usually is), odds are one way is better for what you're trying to do.
  3. Don't blindly assume that vectorized code will be faster - Always profile when you care about performance (this goes for any "optimization" tips).