可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I have some data looks like this.

time
08:28:24
22:20:54 
12:59:38
21:46:07

I want to select the time that stand between 16:00:00 and 23:59:59, this is a closed range.

What should i do with it? ('Time' column type is string.)

Thank you!

回答1:

Your condition can be simplified to checking if the hour part of your time column is between 16 and 23.

You can get the hour by using pyspark.sql.functions.split to tokenize the time column on the : character. Extract the token at index 0 to get the hour, and make the comparison using pyspark.sql.Column.between() (which is inclusive of the bounds).

from pyspark.sql.functions import split
df.where(split("time", ":")[0].between(16, 23)).show()
#+--------+
#|    time|
#+--------+
#|22:20:54|
#|21:46:07|
#+--------+

Note that even though split returns a string, there is an implicit conversion to int to do the between comparison.

Of course, this could be extended if you had more complicated filtering criteria that also involved looking at minutes or seconds:

df.select(
    "*",
    split("time", ":")[0].cast("int").alias("hour"),
    split("time", ":")[1].cast("int").alias("minute"),
    split("time", ":")[2].cast("int").alias("second")
).show()
#+--------+----+------+------+
#|    time|hour|minute|second|
#+--------+----+------+------+
#|08:28:24|   8|    28|    24|
#|22:20:54|  22|    20|    54|
#|12:59:38|  12|    59|    38|
#|21:46:07|  21|    46|     7|
#+--------+----+------+------+

回答2:

You can use the column function between when you convert your column. See the example with comments below:

import datetime
from pyspark.sql import functions as F
from pyspark.sql import types as T

#pyspark does not provide a type to hold time only values,
#therefore we have to work with the datetime format.
#When your string  only represents the time, pyspark will automatically add the current date.
#The filter expression we will use later for your range needs also the current date.
now = datetime.datetime.now().strftime("%Y-%m-%d")

l1 = [('08:28:24',)
    ,('22:20:54',)
    ,('12:59:38',)
    ,('21:46:07',)
    ,('16:00:00',)]

df = spark.createDataFrame(l1,['time'])
#Converting
df = df.withColumn('time', df.time.cast(T.TimestampType()))
#Applying your filter with the current date
df = df.filter(F.col("time").between(now + ' 16:00:00',now + ' 23:59:59'))

df.show()

Output:

+-------------------+ 
|               time|
+-------------------+
|2019-03-12 22:20:54| 
|2019-03-12 21:46:07| 
|2019-03-12 16:00:00| 
+-------------------+