New to pandas, I already want to parallelize a row-wise apply operation. So far I found Parallelize apply after pandas groupby However, that only seems to work for grouped data frames.
My use case is different: I have a list of holidays and for my current row/date want to find the no-of-days before and after this day to the next holiday.
This is the function I call via apply:
def get_nearest_holiday(x, pivot):
nearestHoliday = min(x, key=lambda x: abs(x- pivot))
difference = abs(nearesHoliday - pivot)
return difference / np.timedelta64(1, 'D')
How can I speed it up?
edit
I experimented a bit with pythons pools - but it was neither nice code, nor did I get my computed results.
I think going down the route of trying stuff in parallel is probably over complicating this. I haven't tried this approach on a large sample so your mileage may vary, but it should give you an idea...
Let's just start with some dates...
We'll use some holiday data from
pandas.tseries.holiday
- note that in effect we want aDatetimeIndex
...This gives us:
Now we find the indices of the nearest nearest holiday for the original dates using
searchsorted
:Then take the difference between the two:
You'll need to be careful about the indices so you don't wrap around, and for the previous date, do the calculation with the
indices - 1
but it should act as (I hope) a relatively good base.For the parallel approach this is the answer based on Parallelize apply after pandas groupby:
but I prefer @NinjaPuppy's approach because it does not require O(n * number_of_holidays)