I have a pandas.Series
containing integers, but I need to convert these to strings for some downstream tools. So suppose I had a Series
object:
import numpy as np
import pandas as pd
x = pd.Series(np.random.randint(0, 100, 1000000))
On StackOverflow and other websites, I've seen most people argue that the best way to do this is:
%% timeit
x = x.astype(str)
This takes about 2 seconds.
When I use x = x.apply(str)
, it only takes 0.2 seconds.
Why is x.astype(str)
so slow? Should the recommended way be x.apply(str)
?
I'm mainly interested in python 3's behavior for this.
Let's begin with a bit of general advise: If you're interested in finding the bottlenecks of Python code you can use a profiler to find the functions/parts that eat up most of the time. In this case I use a line-profiler because you can actually see the implementation and the time spent on each line.
However, these tools don't work with C or Cython by default. Given that CPython (that's the Python interpreter I'm using), NumPy and pandas make heavy use of C and Cython there will be a limit how far I'll get with profiling.
Actually: one probably could extend profiling to the Cython code and probably also the C code by recompiling it with debug symbols and tracing, however it's not an easy task to compile these libraries so I won't do that (but if someone likes to do that the Cython documentation includes a page about profiling Cython code).
But let's see how far I can get:
Line-Profiling Python code
I'm going to use line-profiler and a Jupyter Notebook here:
Profiling
x.astype
So that's simply a decorator and 100% of the time is spent in the decorated function. So let's profile the decorated function:
Source
Again one line is the bottleneck so let's check the
_data.astype
method:Okay, another delegate, let's see what
_data.apply
does:Source
And again ... one function call is taking all the time, this time it's
x._data.blocks[0].astype
:.. which is another delegate...
Source
... okay, still not there. Let's check out
astype_nansafe
:Source
Again one it's one line that takes 100%, so I'll go one function further:
Okay, we found a
built-in function
, that means it's a C function. In this case it's a Cython function. But it means we cannot dig deeper with line-profiler. So I'll stop here for now.Profiling
x.apply
Source
Again it's one function that takes most of the time:
lib.map_infer
...Okay, that's another Cython function.
This time there's another (although less significant) contributor with ~3%:
values = self.asobject
. But I'll ignore this for now, because we're interested in the major contributors.Going into C/Cython
The functions called by
astype
This is the
astype_unicode
function:Source
This function uses this helper:
Source
Which itself uses this C function:
Source
Functions called by
apply
This is the implementation of the
map_infer
function:Source
With this helper:
Source
Which uses this C function:
Source
Some thoughts on the Cython code
There are some differences between the Cython codes that are called eventually.
The one taken by
astype
usesunicode
while theapply
path uses the function passed in. Let's see if that makes a difference (again IPython/Jupyter makes it very easy to compile Cython code yourself):Timing:
Okay, there is a difference but it's wrong, it would actually indicate that
apply
would be slightly slower.But remember the
asobject
call that I mentioned earlier in theapply
function? Could that be the reason? Let's see:Now it looks better. The conversion to an object array made the function called by apply much faster. There is a simple reason for this:
str
is a Python function and these are generally much faster if you already have Python objects and NumPy (or Pandas) don't need to create a Python wrapper for the value stored in the array (which is generally not a Python object, except when the array is of dtypeobject
).However that doesn't explain the huge difference that you've seen. My suspicion is that there is actually an additional difference in the ways the arrays are iterated over and the elements are set in the result. Very likely the:
part of the
map_infer
function is faster than:which is called by the
astype(str)
path. The comments of the first function seem to indicate that the writer ofmap_infer
actually tried to make the code as fast as possible (see the comment about "is there a faster way to unbox?" while the other one maybe was written without special care about performance. But that's just a guess.Also on my computer I'm actually quite close to the performance of the
x.astype(str)
andx.apply(str)
already:Note that I also checked some other variants that return a different result:
Interestingly the Python loop with
list
andmap
seems to be the fastest on my computer.I actually made a small benchmark including plot:
Note that it's a log-log plot because of the huge range of sizes I covered in the benchmark. However lower means faster here.
The results may be different for different versions of Python/NumPy/Pandas. So if you want to compare it, these are my versions:
Performance
It's worth looking at actual performance before beginning any investigation since, contrary to popular opinion,
list(map(str, x))
appears to be slower thanx.apply(str)
.Points worth noting:
lambda
function is used].Why is x.map / x.apply fast?
This appears to be because it uses fast compiled Cython code:
Why is x.astype(str) slow?
Pandas applies
str
to each item in the series, not using the above Cython.Hence performance is comparable to
[str(i) for i in x]
/list(map(str, x))
.Why is x.values.astype(str) so fast?
Numpy does not apply a function on each element of the array. One description of this I found:
There is a technical reason why the numpy version hasn't been implemented in the case of no-nulls.