This question is related to the previous one. I have two dataframes in Scala:
df1 =
ID start_date_time field1 field2
1 2016-10-12 11:55:23 AAA xxx1
2 2016-10-12 12:25:00 BBB xxx2
3 2016-10-12 16:20:00 CCC xxx3
and
df2 =
PK start_date
1 2016-10-12
2 2016-10-14
I need to add a new column to df1 that will have value 0 if the following condition fails, otherwise -> 1:
If ID == PK and start_date_time refers to the same year, month and day as start_date.
The result should be this one:
df1 =
ID start_date_time check field1 field2
1 2016-10-12-11-55-23 1 AAA xxx1
2 2016-10-12-12-25-00 0 BBB xxx2
3 2016-10-12-16-20-00 0 CCC xxx3
I use this solution:
import org.apache.spark.sql.functions.lit
val df1_date = df1.withColumn("date", to_date(df1("start_date_time")))
val df2_date = (df2.withColumn("date", to_date(df2("start_date"))).
withColumn("check", lit(1)).
select($"PK".as("ID"), $"date", $"check", $"field1", $"field2"))
df1_date.join(df2_date, Seq("ID", "date"), "left").drop($"date").na.fill(0).show
However, is it possible to not mention explicitly all the column names from df1
in select($"PK".as("ID"), $"date", $"check", $"field1", $"field2"))
?
Is it possible to do something like this?: select($"PK".as("ID"), $"date", $"check", *))