So as I know in Spark Dataframe, that for multiple columns can have the same name as shown in below dataframe snapshot:
[
Row(a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}), a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0})),
Row(a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}), a=125231, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0047, 3: 0.0, 4: 0.0043})),
Row(a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}), a=145831, f=SparseVector(5, {0: 0.0, 1: 0.2356, 2: 0.0036, 3: 0.0, 4: 0.4132})),
Row(a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}), a=147031, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0})),
Row(a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}), a=149231, f=SparseVector(5, {0: 0.0, 1: 0.0032, 2: 0.2451, 3: 0.0, 4: 0.0042}))
]
Above result is created by join with a dataframe to itself, you can see there are 4
columns with both two a
and f
.
The problem is is there when I try to do more calculation with the a
column, I cant find a way to select the a
, I have try df[0]
and df.select('a')
, both returned me below error mesaage:
AnalysisException: Reference 'a' is ambiguous, could be: a#1333L, a#1335L.
Is there anyway in Spark API that I can distinguish the columns from the duplicated names again? or maybe some way to let me change the column names?
I would recommend that you change the column names for your
join
The resulting
DataFrame
will haveschema
After digging into the Spark API, I found I can first use
alias
to create a alias for the original dataframe then usewithColumnRename
to manually rename every column on the alias, at last to do thejoin
without causing the column name duplication.More detail can be refer to below Spark Dataframe API:
pyspark.sql.DataFrame.alias
pyspark.sql.DataFrame.withColumnRenamed
However, I think this is only a troublesome workaround, and wondering if there is any better way for my question.
This is how we can join two Dataframes on same column names in PySpark.
If you do
printSchema()
after this then you can see that duplicate columns have been removed.There is a simpler way than writing aliases for all of the columns you are joining on by doing:
This works if the key that you are joining on is the same in both tables.
See https://docs.databricks.com/spark/latest/faq/join-two-dataframes-duplicated-column.html
Lets start with some data:
There are a few ways you can approach this problem. First of all you can unambiguously reference child table columns using parent columns:
You can also use table aliases:
Finally you can programmatically rename columns:
Suppose the DataFrames you want to join are df1 and df2, and you are joining them on column 'a', then you have 2 methods
Method 1
This is an awsome method and it is highly recommended.
Method 2