I'm using SparkSQL on pyspark to store some PostgreSQL tables into DataFrames and then build a query that generates several time series based on a start
and stop
columns of type date
.
Suppose that my_table
contains:
start | stop
-------------------------
2000-01-01 | 2000-01-05
2012-03-20 | 2012-03-23
In PostgreSQL it's very easy to do that:
SELECT generate_series(start, stop, '1 day'::interval)::date AS dt FROM my_table
and it will generate this table:
dt
------------
2000-01-01
2000-01-02
2000-01-03
2000-01-04
2000-01-05
2012-03-20
2012-03-21
2012-03-22
2012-03-23
but how to do that using plain SparkSQL? Will it be necessary to use UDFs or some DataFrame methods?
EDIT
This creates a dataframe with one row containing an array of consecutive dates:
You can use the explode function to "pivot" this array into rows:
(End of edit)
Spark v2.4 support
sequence
function:https://docs.databricks.com/spark/latest/spark-sql/language-manual/functions.html#sequence
@Rakesh answer is correct, but I would like to share a less verbose solution:
Suppose you have dataframe
df
from spark sql, Try thisThe existing answers will work, but are very inefficient. Instead it is better to use
range
and then cast data. In PythonExample usage: