In standard SQL, when you join a table to itself, you can create aliases for the tables to keep track of which columns you are referring to:
SELECT a.column_name, b.column_name...
FROM table1 a, table1 b
WHERE a.common_field = b.common_field;
There are two ways I can think of to achieve the same thing using the Spark DataFrame
API:
Solution #1: Rename the columns
There are a couple of different methods for this in answer to this question. This one just renames all the columns with a specific suffix:
df.toDF(df.columns.map(_ + "_R"):_*)
For example you can do:
df.join(df.toDF(df.columns.map(_ + "_R"):_*), $"common_field" === $"common_field_R")
Solution #2: Copy the reference to the DataFrame
Another simple solution is to just do this:
val df: DataFrame = ....
val df_right = df
df.join(df_right, df("common_field") === df_right("common_field"))
Both of these solutions work, and I could see each being useful in certain situations. Are there any internal differences between the two I should be aware of?