I have two sample dataframes df_a
and df_b
df_a
+----+------+-----------+-----------+
| id | name | mobile1 | address |
+----+------+-----------+-----------+
| 1 | Matt | 123456798 | |
+----+------+-----------+-----------+
| 2 | John | 123456798 | |
+----+------+-----------+-----------+
| 3 | Lena | | |
+----+------+-----------+-----------+
df_b
+----+------+-----------+-------+---------+
| id | name | mobile2 | city | country |
+----+------+-----------+-------+---------+
| 3 | Lena | 123456798 |Detroit| USA |
+----+------+-----------+-------+---------+
and I'm trying to select certain columns from both as follows
df_a.join(df_b, df_a("id") <=> df_b("id"), "left_outer").select(
df_a("name"), df_a("id"), df_a("address"),
coalesce(df_a("mobile1"), df_b("mobile2"), lit(0))
)
I want to do a similar operation on two actual dataframes where df_a
has a huge number of columns. I want to select all the columns in df_a
in a particular order and two columns from df_b
. So I tried the following
val df_a_cols : String = "DFA.name,DFA.id,DFA.address"
df_a.as("DFA").join(df_b, df_a("id") <=> df_b("id"), "left_outer")
.select(
df_a_cols,
coalesce(df_a("mobile1"), df_b("mobile2"), lit(0))
)
and
val df_a_cols : String = "DFA.name,DFA.id,DFA.address"
df_a.as("DFA").join(df_b, df_a("id") <=> df_b("id"), "left_outer")
.selectExpr(
df_a_cols,
coalesce(df_a("mobile1"), df_b("mobile2"), lit(0))
)
But apparently I'm providing the wrong type of arguments for select
and selectExpr
Could someone please help me with this? I'm using Spark 1.5.0.
Update
I tried the following
val df_a_cols : String = "DFA.name,DFA.id,DFA.address"
df_a.as("DFA").join(df_b.as("DFB"), df_a("id") <=> df_b("id"), "left_outer")
.select(
df_a_cols+",coalesce(DFA.mobile1, DFB.mobile2, 0)"
)
and got an error
org.apache.spark.sql.AnalysisException: cannot resolve 'DFA.name,DFA.id,DFA.address,coalesce(DFA.mobile1, DFB.mobile2, 0) ' given input columns id, name, mobile1, address, mobile2, city, country;
Then I tried
val df_a_cols : String = "name,id,address"
df_a.as("DFA").join(df_b.as("DFB"), df_a("id") <=> df_b("id"), "left_outer")
.select(
df_a_cols+",coalesce(mobile1, mobile2, 0)"
)
And got
org.apache.spark.sql.AnalysisException: cannot resolve ' name,id,address,coalesce(mobile1, mobile2, 0) ' given input columns id, name, mobile1, address, mobile2, city, country;
With
val df_a_cols : String = "name,id,address"
df_a.as("DFA").join(df_b.as("DFB"), df_a("id") <=> df_b("id"), "left_outer")
.selectExpr(
df_a_cols+",coalesce(mobile1, mobile2, 0)"
)
I got
java.lang.RuntimeException: [1.10] failure: identifier expected
name,id,address,coalesce(mobile1, mobile2, 0)
^