I am familiar with the concept of "vectorization", and how pandas employs vectorized techniques to speed up computation. Vectorized functions broadcast operations over the entire series or DataFrame to achieve speedups much greater than conventionally iterating over the data.
However, I am quite surprised to see a lot of code (including from answers on Stack Overflow) offering solutions to problems that involve looping through data using for
loops and list comprehensions. Having read the documentation, and with a decent understanding of the API, I am given to believe that loops are "bad", and that one should "never" iterate over arrays, series, or DataFrames. So, how come I see users suggesting loopy solutions every now and then?
So, to summarise... my question is:
Are for
loops really "bad"? If not, in what situation(s) would they be better than using a more conventional "vectorized" approach?1
1 - While it is true that the question sounds somewhat broad, the truth is that there are very specific situations when for
loops are usually better than conventionally iterating over data. This post aims to capture this for posterity.
TLDR; No,
for
loops are not blanket "bad", at least, not always. It is probably more accurate to say that some vectorized operations are slower than iterating, versus saying that iteration is faster than some vectorized operations. Knowing when and why is key to getting the most performance out of your code. In a nutshell, these are the situations where it is worth considering an alternative to vectorized pandas functions:object
/mixed dtypesstr
/regex accessor functionsLet's examine these situations individually.
Iteration v/s Vectorization on Small Data
Pandas follows a "Convention Over Configuration" approach in its API design. This means that the same API has been fitted to cater to a broad range of data and use cases.
When a pandas function is called, the following things (among others) must internally be handled by the function, to ensure working
Almost every function will have to deal with these to varying extents, and this presents an overhead. The overhead is less for numeric functions (for example,
Series.add
), while it is more pronounced for string functions (for example,Series.str.replace
).for
loops, on the other hand, are faster then you think. What's even better is list comprehensions (which create lists throughfor
loops) are even faster as they are optimized iterative mechanisms for list creation.List comprehensions follow the pattern
Where
seq
is a pandas series or DataFrame column. Or, when operating over multiple columns,Where
seq1
andseq2
are columns.Numeric Comparison
Consider a simple boolean indexing operation. The list comprehension method has been timed against
Series.ne
(!=
) andquery
. Here are the functions:For simplicity, I have used the
perfplot
package to run all the timeit tests in this post. The timings for the operations above are below:The list comprehension outperforms
query
for moderately sized N, and even outperforms the vectorized not equals comparison for tiny N. Unfortunately, the list comprehension scales linearly, so it does not offer much performance gain for larger N.Value Counts
Taking another example - this time, with another vanilla python construct that is faster than a for loop -
collections.Counter
. A common requirement is to compute the value counts and return the result as a dictionary. This is done withvalue_counts
,np.unique
, andCounter
:The results are more pronounced,
Counter
wins out over both vectorized methods for a larger range of small N (~3500).Of course, the take away from here is that the performance depends on your data and use case. The point of these examples is to convince you not to rule out these solutions as legitimate options. If these still don't give you the performance you need, there is always cython and numba. Let's add this test into the mix.
Numba offers JIT compilation of loopy python code to very powerful vectorized code. Understanding how to make numba work involves a learning curve.
Operations with Mixed/
object
dtypesString-based Comparison
Revisiting the filtering example from the first section, what if the columns being compared are strings? Consider the same 3 functions above, but with the input DataFrame cast to string.
So, what changed? The thing to note here is that string operations are inherently difficult to vectorize. Pandas treats strings as objects, and all operations on objects fall back to a slow, loopy implementation.
Now, because this loopy implementation is surrounded by all the overhead mentioned above, there is a constant magnitude difference between these solutions, even though they scale the same.
When it comes to operations on mutable/complex objects, there is no comparison. List comprehension outperforms all operations involving dicts and lists.
Accessing Dictionary Value(s) by Key
Here are timings for two operations that extract a value from a column of dictionaries:
map
and the list comprehension. The setup is in the Appendix, under the heading "Code Snippets".Positional List Indexing
Timings for 3 operations that extract the 0th element from a list of columns (handling exceptions),
map
,str.get
accessor method, and the list comprehension:List Flattening
A final example is flattening lists. This is another common problem, and demonstrates just how powerful pure python is here.
Both
itertools.chain.from_iterable
and the nested list comprehension are pure python constructs, and scale much better than thestack
solution.These timings are a strong indication of the fact that pandas is not equipped to work with mixed dtypes, and that you should probably refrain from using it to do so. Wherever possible, data should be present as scalar values (ints/floats/strings) in separate columns.
Lastly, the applicability of these solutions depend widely on your data. So, the best thing to do would be to test these operations on your data before deciding what to go with. Notice how I have not timed
apply
on these solutions, because it would skew the graph (yes, it's that slow).Regex Operations, and
.str
Accessor MethodsPandas can apply regex operations such as
str.contains
,str.extract
, andstr.extractall
, as well as other "vectorized" string operations (such asstr.split
, str.find,
str.translate`, and so on) on string columns. These functions are slower than list comprehensions, and are meant to be more convenience functions than anything else.It is usually much faster to pre-compile a regex pattern and iterate over your data with
re.compile
(also see Is it worth using Python's re.compile?). The list comp equivalent tostr.contains
looks something like this:Or,
If you need to handle NaNs, you can do something like
The list comp equivalent to
str.extract
(without groups) will look something like:If you need to handle no-matches and NaNs, you can use a custom function (still faster!):
The
matcher
function is very extensible. It can be fitted to return a list for each capture group, as needed. Just extract query thegroup
orgroups
attribute of the matcher object.For
str.extractall
, changep.search
top.findall
.String Extraction
Consider a simple filtering operation. The idea is to extract 4 digits if it is preceded by an upper case letter.
More Examples
Full disclosure - I am the author (in part or whole) of these posts listed below.
Fast punctuation removal with pandas
How to check if first word of a DataFrame string column is present in a List in Python?
Python Pandas - How to extract the left a series of characters in a string
Replace all but the last occurrence of a character in a dataframe
Conclusion
As shown from the examples above, iteration shines when working with small rows of DataFrames, mixed datatypes, and regular expressions.
The speedup you get depends on your data and your problem, so your mileage may vary. The best thing to do is to carefully run tests and see if the payout is worth the effort.
The "vectorized" functions shine in their simplicity and readability, so if performance is not critical, you should definitely prefer those.
Another side note, certain string operations deal with constraints that favour the use of NumPy. Here are two examples where careful NumPy vectorization outperforms python:
Create new column with incremental values in a faster and efficient way - Answer by Divakar
Fast punctuation removal with pandas - Answer by Paul Panzer
As mentioned above, it's up to you to decide whether these solutions are worth the trouble of implementing.
Appendix: Code Snippets