I have data like this:
df = sqlContext.createDataFrame([
('1986/10/15', 'z', 'null'),
('1986/10/15', 'z', 'null'),
('1986/10/15', 'c', 'null'),
('1986/10/15', 'null', 'null'),
('1986/10/16', 'null', '4.0')],
('low', 'high', 'normal'))
I want to calculate the date difference between low
column and 2017-05-02
and replace low
column with the difference. I've tried related solutions on stackoverflow but neither of them works.
You need to cast the column low
to class date and then you can use datediff()
in combination with lit()
. Using Spark 2.2:
from pyspark.sql.functions import datediff, to_date, lit
df.withColumn("test",
datediff(to_date(lit("2017-05-02")),
to_date("low","yyyy/MM/dd"))).show()
+----------+----+------+-----+
| low|high|normal| test|
+----------+----+------+-----+
|1986/10/15| z| null|11157|
|1986/10/15| z| null|11157|
|1986/10/15| c| null|11157|
|1986/10/15|null| null|11157|
|1986/10/16|null| 4.0|11156|
+----------+----+------+-----+
Using < Spark 2.2, we need to convert the the low
column to class timestamp
first:
from pyspark.sql.functions import datediff, to_date, lit, unix_timestamp
df.withColumn("test",
datediff(to_date(lit("2017-05-02")),
to_date(unix_timestamp('low', "yyyy/MM/dd").cast("timestamp")))).show()
Alternatively, how to find the number of days passed between two subsequent user's actions using pySpark:
import pyspark.sql.functions as funcs
from pyspark.sql.window import Window
window = Window.partitionBy('user_id').orderBy('action_date')
df = df.withColumn("days_passed", funcs.datediff(df.action_date,
lag(df.action_date, 1).over(window)))
+----------+-----------+-----------+
| user_id|action_date|days_passed|
+----------+-----------+-----------+
|623 |2015-10-21| null|
|623 |2015-11-19| 29|
|623 |2016-01-13| 59|
|623 |2016-01-21| 8|
|623 |2016-03-24| 63|
+----------+----------+------------+