In pandas I have a function similar to
indices = df.dateColumn.apply(holidays.index.searchsorted)
df['nextHolidays'] = holidays.index[indices]
df['previousHolidays'] = holidays.index[indices - 1]
which calculates the distance to the nearest holiday and stores that as a new column.
searchsorted
http://pandas.pydata.org/pandas-docs/version/0.18.1/generated/pandas.Series.searchsorted.html was a great solution for pandas as this gives me the index of the next holiday without a high algorithmic complexity Parallelize pandas apply e.g. this approach was a lot quicker then parallel looping.
How can I achieve this in spark or hive?
This can be done using aggregations but this method would have higher complexity than pandas method. But you can achieve similar performance using UDFs. It won't be as elegant as pandas, but:
Assuming this dataset of holidays:
And dataset of dates of 2016 in dataframe:
The UDF can use pandas
searchsorted
but would need to install pandas on executors. Insted you can use plan python like this:And can be used with
withColumn
: