Spark/ Scala- Select Columns From Multiple Datafra

2019-09-11 19:44发布

问题:

I have two sample dataframes df_a and df_b

df_a
+----+------+-----------+-----------+
| id | name | mobile1   | address   |
+----+------+-----------+-----------+
| 1  | Matt | 123456798 |           |
+----+------+-----------+-----------+
| 2  | John | 123456798 |           |
+----+------+-----------+-----------+
| 3  | Lena |           |           |
+----+------+-----------+-----------+

df_b 
+----+------+-----------+-------+---------+
| id | name | mobile2   | city  | country |
+----+------+-----------+-------+---------+
| 3  | Lena | 123456798 |Detroit|  USA    |
+----+------+-----------+-------+---------+

and I'm trying to select certain columns from both as follows

df_a.join(df_b, df_a("id") <=> df_b("id"), "left_outer").select(
 df_a("name"), df_a("id"), df_a("address"),
 coalesce(df_a("mobile1"), df_b("mobile2"), lit(0))
)

I want to do a similar operation on two actual dataframes where df_a has a huge number of columns. I want to select all the columns in df_a in a particular order and two columns from df_b. So I tried the following

val df_a_cols : String = "DFA.name,DFA.id,DFA.address"
df_a.as("DFA").join(df_b, df_a("id") <=> df_b("id"), "left_outer")
.select(
 df_a_cols,
 coalesce(df_a("mobile1"), df_b("mobile2"), lit(0))
)

and

val df_a_cols : String = "DFA.name,DFA.id,DFA.address"
df_a.as("DFA").join(df_b, df_a("id") <=> df_b("id"), "left_outer")
.selectExpr(
 df_a_cols,
 coalesce(df_a("mobile1"), df_b("mobile2"), lit(0))
)

But apparently I'm providing the wrong type of arguments for select and selectExpr

Could someone please help me with this? I'm using Spark 1.5.0.

Update

I tried the following

val df_a_cols : String = "DFA.name,DFA.id,DFA.address"
df_a.as("DFA").join(df_b.as("DFB"), df_a("id") <=> df_b("id"), "left_outer")
.select(
 df_a_cols+",coalesce(DFA.mobile1, DFB.mobile2, 0)"
)

and got an error

org.apache.spark.sql.AnalysisException: cannot resolve 'DFA.name,DFA.id,DFA.address,coalesce(DFA.mobile1, DFB.mobile2, 0) ' given input columns id, name, mobile1, address, mobile2, city, country;

Then I tried

val df_a_cols : String = "name,id,address"
df_a.as("DFA").join(df_b.as("DFB"), df_a("id") <=> df_b("id"), "left_outer")
.select(
 df_a_cols+",coalesce(mobile1, mobile2, 0)"
)

And got

org.apache.spark.sql.AnalysisException: cannot resolve ' name,id,address,coalesce(mobile1, mobile2, 0) ' given input columns id, name, mobile1, address, mobile2, city, country;

With

val df_a_cols : String = "name,id,address"
df_a.as("DFA").join(df_b.as("DFB"), df_a("id") <=> df_b("id"), "left_outer")
.selectExpr(
 df_a_cols+",coalesce(mobile1, mobile2, 0)"
)

I got

java.lang.RuntimeException: [1.10] failure: identifier expected

name,id,address,coalesce(mobile1, mobile2, 0)
    ^