Join two dataframes by id

2019-07-31 17:59发布

This question is related to the previous one. I have two dataframes in Scala:

df1 =

ID  start_date_time      field1    field2
1   2016-10-12 11:55:23  AAA       xxx1
2   2016-10-12 12:25:00  BBB       xxx2
3   2016-10-12 16:20:00  CCC       xxx3

and

df2 =

PK  start_date
1   2016-10-12
2   2016-10-14

I need to add a new column to df1 that will have value 0 if the following condition fails, otherwise -> 1:

If ID == PK and start_date_time refers to the same year, month and day as start_date.

The result should be this one:

df1 =

ID  start_date_time      check  field1   field2
1   2016-10-12-11-55-23  1      AAA      xxx1
2   2016-10-12-12-25-00  0      BBB      xxx2
3   2016-10-12-16-20-00  0      CCC      xxx3

I use this solution:

import org.apache.spark.sql.functions.lit

val df1_date = df1.withColumn("date", to_date(df1("start_date_time")))
val df2_date = (df2.withColumn("date", to_date(df2("start_date"))).
                        withColumn("check", lit(1)).
                        select($"PK".as("ID"), $"date", $"check", $"field1", $"field2"))

df1_date.join(df2_date, Seq("ID", "date"), "left").drop($"date").na.fill(0).show

However, is it possible to not mention explicitly all the column names from df1 in select($"PK".as("ID"), $"date", $"check", $"field1", $"field2"))? Is it possible to do something like this?: select($"PK".as("ID"), $"date", $"check", *))

标签： scala apache-spark

0条回答

Join two dataframes by id

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间