I am having a Spark SQL DataFrame
with data and what I'm trying to get is all the rows preceding current row in a given date range. So for example I want to have all the rows from 7 days back preceding given row. I figured out I need to use a Window Function
like:
Window \
.partitionBy('id') \
.orderBy('start')
and here comes the problem. I want to have a rangeBetween
7 days, but there is nothing in the Spark docs I could find on this. Does Spark even provide such option? For now I'm just getting all the preceding rows with:
.rowsBetween(-sys.maxsize, 0)
but would like to achieve something like:
.rangeBetween("7 days", 0)
If anyone could help me on this one I'll be very grateful. Thanks in advance!
Spark >= 2.3
Since Spark 2.3 it is possible to use interval objects using SQL API, but the
DataFrame
API support is still work in progress.Spark < 2.3
As far as I know it is not possible directly neither in Spark nor Hive. Both require
ORDER BY
clause used withRANGE
to be numeric. The closest thing I found is conversion to timestamp and operating on seconds. Assumingstart
column containsdate
type:A small helper and window definition:
Finally query:
Far from pretty but works.
* Hive Language Manual, Types