I have some data looks like this.
time
08:28:24
22:20:54
12:59:38
21:46:07
I want to select the time that stand between 16:00:00 and 23:59:59, this is a closed range.
What should i do with it? ('Time' column type is string.)
Thank you!
I have some data looks like this.
time
08:28:24
22:20:54
12:59:38
21:46:07
I want to select the time that stand between 16:00:00 and 23:59:59, this is a closed range.
What should i do with it? ('Time' column type is string.)
Thank you!
Your condition can be simplified to checking if the hour part of your time
column is between 16
and 23
.
You can get the hour by using pyspark.sql.functions.split
to tokenize the time
column on the :
character. Extract the token at index 0 to get the hour, and make the comparison using pyspark.sql.Column.between()
(which is inclusive of the bounds).
from pyspark.sql.functions import split
df.where(split("time", ":")[0].between(16, 23)).show()
#+--------+
#| time|
#+--------+
#|22:20:54|
#|21:46:07|
#+--------+
Note that even though split
returns a string, there is an implicit conversion to int
to do the between
comparison.
Of course, this could be extended if you had more complicated filtering criteria that also involved looking at minutes or seconds:
df.select(
"*",
split("time", ":")[0].cast("int").alias("hour"),
split("time", ":")[1].cast("int").alias("minute"),
split("time", ":")[2].cast("int").alias("second")
).show()
#+--------+----+------+------+
#| time|hour|minute|second|
#+--------+----+------+------+
#|08:28:24| 8| 28| 24|
#|22:20:54| 22| 20| 54|
#|12:59:38| 12| 59| 38|
#|21:46:07| 21| 46| 7|
#+--------+----+------+------+
You can use the column function between when you convert your column. See the example with comments below:
import datetime
from pyspark.sql import functions as F
from pyspark.sql import types as T
#pyspark does not provide a type to hold time only values,
#therefore we have to work with the datetime format.
#When your string only represents the time, pyspark will automatically add the current date.
#The filter expression we will use later for your range needs also the current date.
now = datetime.datetime.now().strftime("%Y-%m-%d")
l1 = [('08:28:24',)
,('22:20:54',)
,('12:59:38',)
,('21:46:07',)
,('16:00:00',)]
df = spark.createDataFrame(l1,['time'])
#Converting
df = df.withColumn('time', df.time.cast(T.TimestampType()))
#Applying your filter with the current date
df = df.filter(F.col("time").between(now + ' 16:00:00',now + ' 23:59:59'))
df.show()
Output:
+-------------------+
| time|
+-------------------+
|2019-03-12 22:20:54|
|2019-03-12 21:46:07|
|2019-03-12 16:00:00|
+-------------------+