How to join two dataframes in Scala and select on

2019-09-14 18:27发布

问题:

I have to join two dataframes, which is very similar to the task given here Joining two DataFrames in Spark SQL and selecting columns of only one

However, I want to select only the second column from df2. In my task, I am going to use the join function for two dataframes within a reduce function for a list of dataframes. In this list of dataframes, the column names will be different. However, in each case I would want to keep the second column of df2.

I did not find anywhere how to select a dataframe's column by their numbered index. Any help is appreciated!

EDIT:

ANSWER

I figured out the solution. Here is one way to do this:

def joinDFs(df1: DataFrame, df2: DataFrame): DataFrame = {
  val df2cols = df2.columns
  val desiredDf2Col = df2cols(1)  // the second column
  val df3 = df1.as("df1").join(df2.as("df2"), $"df1.time" === $"df2.time")
      .select($"df1.*",$"df2.$desiredDf2Col")
  df3
}

And then I can apply this function in a reduce operation on a list of dataframes.

var listOfDFs: List[DataFrame] = List()
// Populate listOfDFs as you want here
val joinedDF = listOfDFs.reduceLeft((x, y) => {joinDFs(x, y)})

回答1:

To select the second column in your dataframe you can simply do:

val df3 = df2.select(df2.columns(1))

This will first find the second column name and then select it.



回答2:

If the join and select methods that you want to define in reduce function is similar to Joining two DataFrames in Spark SQL and selecting columns of only one
Then you should do the following :

import org.apache.spark.sql.functions._
d1.as("d1").join(d2.as("d2"), $"d1.id" === $"d2.id").select(Seq(1) map d2.columns map col: _*)

You will have to remember that the name of the second column i.e. Seq(1) should not be same as any of the dataframes column names.
You can select multiple columns as well but remember the bold note above

import org.apache.spark.sql.functions._
d1.as("d1").join(d2.as("d2"), $"d1.id" === $"d2.id").select(Seq(1, 2) map d2.columns map col: _*)