how to compute diff for one col in spark dataframe

+-------------------+
|           Dev_time|
+-------------------+
|2015-09-18 05:00:20|
|2015-09-18 05:00:21|
|2015-09-18 05:00:22|
|2015-09-18 05:00:23|
|2015-09-18 05:00:24|
|2015-09-18 05:00:25|
|2015-09-18 05:00:26|
|2015-09-18 05:00:27|
|2015-09-18 05:00:37|
|2015-09-18 05:00:37|
|2015-09-18 05:00:37|
|2015-09-18 05:00:38|
|2015-09-18 05:00:39|
+-------------------+

For spark's dataframe, I want to compute the diff of the datetime ,just like in numpy.diff(array)

Generally speaking there is no efficient way to achieve this using Spark DataFrames. Not to mention things like order become quite tricky in a distributed setup. Theoretically you can use lag function as follows:

from pyspark.sql.functions import lag, col, unix_timestamp
from pyspark.sql.window import Window

dev_time = (unix_timestamp(col("dev_time")) * 1000).cast("timestamp")

df = sc.parallelize([
    ("2015-09-18 05:00:20", ), ("2015-09-18 05:00:21", ),
    ("2015-09-18 05:00:22", ), ("2015-09-18 05:00:23", ),
    ("2015-09-18 05:00:24", ), ("2015-09-18 05:00:25", ),
    ("2015-09-18 05:00:26", ), ("2015-09-18 05:00:27", ),
    ("2015-09-18 05:00:37", ), ("2015-09-18 05:00:37", ),
    ("2015-09-18 05:00:37", ), ("2015-09-18 05:00:38", ),
    ("2015-09-18 05:00:39", )
]).toDF(["dev_time"]).withColumn("dev_time", dev_time)

w = Window.orderBy("dev_time")
lag_dev_time = lag("dev_time").over(w).cast("integer")

diff = df.select((col("dev_time").cast("integer") - lag_dev_time).alias("diff"))

## diff.show()
## +----+
## |diff|
## +----+
## |null|
## |   1|
## |   1|
## |   1|
## |   1|
## |   1|
## |   1|
## |   1|
## |  10|
## ...

but it is extremely inefficient (as for window functions move all data to a single partition if no PARTITION BY clause is provided). In practice it makes more sense to use sliding method on a RDD (Scala) or implement your own sliding window (Python). See: