This is of those "mostly asked out of pure curiosity (in possibly futile hope I will learn something)" questions.
I was investigating ways of saving memory on operations on massive numbers of strings, and for some scenarios it seems like string operations in numpy could be useful. However, I got somewhat surprising results:
import random
import string
milstr = [''.join(random.choices(string.ascii_letters, k=10)) for _ in range(1000000)]
npmstr = np.array(milstr, dtype=np.dtype(np.unicode_, 1000000))
Memory consumption using memory_profiler
:
%memit [x.upper() for x in milstr]
peak memory: 420.96 MiB, increment: 61.02 MiB
%memit np.core.defchararray.upper(npmstr)
peak memory: 391.48 MiB, increment: 31.52 MiB
So far, so good; however, timing results are surprising for me:
%timeit [x.upper() for x in milstr]
129 ms ± 926 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit np.core.defchararray.upper(npmstr)
373 ms ± 2.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Why is that? I expected that since Numpy uses contiguous chunks of memory for its arrays AND its operations are vectorized (as the above numpy doc page says) AND numpy string arrays apparently use less memory so operating on them should at least potentially be more on-CPU cache-friendly, performance on arrays of strings would be at least similar to those in pure Python?
Environment:
Python 3.6.3 x64, Linux
numpy==1.14.1
Vectorized is used in two ways when talking about
numpy
, and it`s not always clear which is meant.The second point is what makes vectorized operations much faster than a for loop in python, and the multithreaded part is what makes them faster than a list comprehension. When commenters here state that vectorized code is faster, they're referring to the second case as well. However, in the numpy documentation, vectorized only refers to the first case. It means you can use a function directly on an array, without having to loop through all the elements and call it on each elements. In this sense it makes code more concise, but isn't necessarily any faster. Some vectorized operations do call multithreaded code, but as far as I know this is limited to linear algebra routines. Personally, I prefer using vectorized operatios since I think it is more readable than list comprehensions, even if performance is identical.
Now, for the code in question the documentation for
np.char
(which is an alias fornp.core.defchararray
), statesSo there are four ways (one not recommended) to handle strings in numpy. Some testing is in order, since certainly each way will have different advantages and disadvantages. Using arrays defined as follows:
This creates arrays (or chararrays for the last two) with the following datatypes:
The benchmarks give quite a range of performance across these datatypes:
Surprisingly, using a plain old list of strings is still the fastest. Numpy is competitive when the datatype is
string_
orobject_
, but once unicode is included performance becomes much worse. Thechararray
is by far the slowest, wether handling unicode or not. It should be clear why it's not recommended for use.Using unicode strings has a significant performance penalty. The docs state the following for differences between these types
In this case, where the character set does not require unicode it would make sense to use the faster
string_
type. If unicode was needed, you may get better performance by using a list, or a numpy array of typeobject_
if other numpy functionality is needed. Another good example of when a list may be better is appending lots of dataSo, takeaways from this: