可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I have a dataframe log_df:

I generate a new dataframe based on the following code:

from pyspark.sql.functions import split, regexp_extract 
split_log_df = log_df.select(regexp_extract('value', r'^([^\s]+\s)', 1).alias('host'),
                          regexp_extract('value', r'^.*\[(\d\d/\w{3}/\d{4}:\d{2}:\d{2}:\d{2} -\d{4})]', 1).alias('timestamp'),
                          regexp_extract('value', r'^.*"\w+\s+([^\s]+)\s+HTTP.*"', 1).alias('path'),
                          regexp_extract('value', r'^.*"\s+([^\s]+)', 1).cast('integer').alias('status'),
                          regexp_extract('value', r'^.*\s+(\d+)$', 1).cast('integer').alias('content_size'))
split_log_df.show(10, truncate=False)

the new dataframe is like:

I need another column showing the dayofweek, what would be the best elegant way to create it? ideally just adding a udf like field in the select.

Thank you very much.

Updated: my question is different than the one in the comment, what I need is to make the calculation based on a string in log_df, not based on the timestamp like the comment, so this is not a duplicate question. Thanks.

回答1:

I suggest a bit different method

from pyspark.sql.functions import date_format
df.select('capturetime', date_format('capturetime', 'u').alias('dow_number'), date_format('capturetime', 'E').alias('dow_string'))
df3.show()

It gives ...

+--------------------+----------+----------+
|         capturetime|dow_number|dow_string|
+--------------------+----------+----------+
|2017-06-05 10:05:...|         1|       Mon|
|2017-06-05 10:05:...|         1|       Mon|
|2017-06-05 10:05:...|         1|       Mon|
|2017-06-05 10:05:...|         1|       Mon|
|2017-06-05 10:05:...|         1|       Mon|
|2017-06-05 10:05:...|         1|       Mon|
|2017-06-05 10:05:...|         1|       Mon|
|2017-06-05 10:05:...|         1|       Mon|

回答2:

I finally resolved the question myself, here is the complete solution:

import date_format, datetime, DataType
first, modify the regexp to extract 01/Jul/1995
convert 01/Jul/1995 to DateType using func
create a udf dayOfWeek to get the week day in brief format (Mon, Tue,...)
using the udf to convert the DateType 01/Jul/1995 to weekday which is Sat

I am not satisfied with my solution as it seems to be so zig-zag, it would be appreciated if anyone can come up with a more elegant solution, thank you in advance.

回答3:

I did this to get weekdays from date:

def get_weekday(date):
    import datetime
    import calendar
    month, day, year = (int(x) for x in date.split('/'))    
    weekday = datetime.date(year, month, day)
    return calendar.day_name[weekday.weekday()]

spark.udf.register('get_weekday', get_weekday)

Example of usage:

df.createOrReplaceTempView("weekdays")
df = spark.sql("select DateTime, PlayersCount, get_weekday(Date) as Weekday from weekdays")

回答4:

Since SPARK 1.5.0 has a date_format function that accepts a format as an argument. This format returns a name of a week day from a timestamp:

select date_format(my_timestamp, 'EEEE') from ....

Result: e.g. 'Tuesday'

How to get day of week in SparkSQL?

回答5:

this worked for me:

recreate data similar to your example:

df = spark.createDataFrame(\
    [(1, "2017-11-01 22:05:01 -0400")\
     ,(2, "2017-11-02 03:15:16 -0500")\
     ,(3, "2017-11-03 19:32:24 -0600")\
      ,(4, "2017-11-04 07:47:44 -0700")\
    ], ("id", "date"))

df.toPandas()
    id  date
0   1   2017-11-01 22:05:01 -0400
1   2   2017-11-02 03:15:16 -0500
2   3   2017-11-03 19:32:24 -0600
3   4   2017-11-04 07:47:44 -0700

create lambda function to handle the conversion to week

funcWeekDay =  udf(lambda x: datetime.strptime(x, '%Y-%m-%d').strftime('%w'))

extract the date into shortdate column
create column with weeday, using lambda function
drop the shortdate column

the code:

from pyspark.sql.functions import udf,col
    from datetime import datetime


df=df.withColumn('shortdate',col('date').substr(1, 10))\
     .withColumn('weekDay', funcWeekDay(col('shortdate')))\
     .drop('shortdate')

result:

df.toPandas()
    id  date                     weekDay
0   1   2017-11-01 22:05:01 -0400   3
1   2   2017-11-02 03:15:16 -0500   4
2   3   2017-11-03 19:32:24 -0600   5
3   4   2017-11-04 07:47:44 -0700   6