I have two dataframes with the following columns:
df1.columns
// Array(ts, id, X1, X2)
and
df2.columns
// Array(ts, id, Y1, Y2)
After I do
val df_combined = df1.join(df2, Seq(ts,id))
I end up with the following columns: Array(ts, id, X1, X2, ts, id, Y1, Y2)
. I could expect that the common columns would be dropped. Is there something that additional that needs to be done?
try this,
You can simply use this
Here TYPE-OF-JOIN can be
For example, I have two dataframes like this:
If you do fullouter join then the result looks like this
Inner Join is default join in spark, Below is simple syntax for it.
For Other join you can follow the below syntax
If columns Name are not common then
The simple answer (from the Databricks FAQ on this matter) is to perform the join where the joined columns are expressed as an array of strings (or one string) instead of a predicate.
Below is an example adapted from the Databricks FAQ but with two join columns in order to answer the original poster's question.
Here is the left dataframe:
Here is the right dataframe:
Here is an incorrect solution, where the join columns are defined as the predicate
left("firstname")===right("firstname") && left("lastname")===right("lastname")
.The incorrect result is that the
firstname
andlastname
columns are duplicated in the joined data frame:The correct solution is to define the join columns as an array of strings
Seq("firstname", "lastname")
. The output data frame does not have duplicated columns:This is an expected behavior.
DataFrame.join
method is equivalent to SQL join like thisIf you want to ignore duplicate columns just drop them or select columns of interest afterwards. If you want to disambiguate you can use access these using parent
DataFrames
:or use aliases:
For equi-joins there exist a special shortcut syntax which takes either a sequence of strings:
or as single string
which keep only one copy of columns used in a join condition.
This is a normal behavior from SQL, what I am doing for this:
Here I am replacing "fullname" column:
Some code in Java:
Where the query is:
This is something you can do only with Spark I believe (drop column from list), very very helpful!