Create a new column based on date checking

2020-04-21 00:34发布

问题:

I have two dataframes in Scala:

df1 =

ID  Field1
1   AAA
2   BBB
4   CCC

and

df2 =

PK  start_date_time
1   2016-10-11 11:55:23
2   2016-10-12 12:25:00
3   2016-10-12 16:20:00

I also have a variable start_date with the format yyyy-MM-dd equal to 2016-10-11.

I need to create a new column check in df1 based on the following condition: If PK is equal to ID AND the year, month and day of start_date_time are equal to start_date, then check is equal to 1, otherwise 0.

The result should be this one:

df1 =

ID  Field1  check
1   AAA     1
2   BBB     0
4   CCC     0

In my previous question I had two dataframes and it was suggested to use joining and filtering. However, in this case it won't work. My initial idea was to use udf, but not sure how to make it working for this case.

回答1:

You can combine join and withColumn for this case. i.e. firstly join with df2 on ID column and then use when.otherwise syntax to modify the check column:

import org.apache.spark.sql.functions.lit

val df2_date = df2.withColumn("date", to_date(df2("start_date_time"))).withColumn("check", lit(1)).select($"PK".as("ID"), $"date", $"check")

df1.join(df2_date, Seq("ID"), "left").withColumn("check", when($"date" === "2016-10-11", $"check").otherwise(0)).drop("date").show

+---+------+-----+
| ID|Field1|check|
+---+------+-----+
|  1|   AAA|    1|
|  2|   BBB|    0|
|  4|   CCC|    0|
+---+------+-----+

Or another option, firstly filter on df2, and then join it back with df1 on ID column:

val df2_date = (df2.withColumn("date", to_date(df2("start_date_time"))).
                    filter($"date" === "2016-10-11").
                    withColumn("check", lit(1)).
                    select($"PK".as("ID"), $"date", $"check"))

df1.join(df2_date, Seq("ID"), "left").drop("date").na.fill(0).show

+---+------+-----+
| ID|Field1|check|
+---+------+-----+
|  1|   AAA|    1|
|  2|   BBB|    0|
|  4|   CCC|    0|
+---+------+-----+

In case you have a date like 2016-OCT-11, you can convert it sql Date for comparison as follows:

val format = new java.text.SimpleDateFormat("yyyy-MMM-dd")
val parsed = format.parse("2016-OCT-11")
val date = new java.sql.Date(parsed.getTime())
// date: java.sql.Date = 2016-10-11