I have to join two dataframes, which is very similar to the task given here Joining two DataFrames in Spark SQL and selecting columns of only one
However, I want to select only the second column from df2
. In my task, I am going to use the join function for two dataframes within a reduce
function for a list of dataframes. In this list of dataframes, the column names will be different. However, in each case I would want to keep the second column of df2
.
I did not find anywhere how to select a dataframe's column by their numbered index. Any help is appreciated!
EDIT:
ANSWER
I figured out the solution. Here is one way to do this:
def joinDFs(df1: DataFrame, df2: DataFrame): DataFrame = {
val df2cols = df2.columns
val desiredDf2Col = df2cols(1) // the second column
val df3 = df1.as("df1").join(df2.as("df2"), $"df1.time" === $"df2.time")
.select($"df1.*",$"df2.$desiredDf2Col")
df3
}
And then I can apply this function in a reduce
operation on a list of dataframes.
var listOfDFs: List[DataFrame] = List()
// Populate listOfDFs as you want here
val joinedDF = listOfDFs.reduceLeft((x, y) => {joinDFs(x, y)})