For a set of dataframes
val df1 = sc.parallelize(1 to 4).map(i => (i,i*10)).toDF("id","x")
val df2 = sc.parallelize(1 to 4).map(i => (i,i*100)).toDF("id","y")
val df3 = sc.parallelize(1 to 4).map(i => (i,i*1000)).toDF("id","z")
to union all of them I do
df1.unionAll(df2).unionAll(df3)
Is there a more elegant and scalable way of doing this for any number of dataframes, for example from
Seq(df1, df2, df3)
For pyspark you can do the following:
It's also worth nothing that the order of the columns in the dataframes should be the same for this to work. This can silently give unexpected results if you don't have the correct column orders!!
If you are using pyspark 2.3 or greater, you can use unionByName so you don't have to reorder the columns.
The simplest solution is to
reduce
withunion
(unionAll
in Spark < 2.0):This is relatively concise and shouldn't move data from off-heap storage
but extends lineage with each unionrequires non-linear time to perform plan analysis. what can be a problem if you try to merge large number ofDataFrames
.You can also convert to
RDDs
and useSparkContext.union
:It keeps
lineage shortanalysis cost low but otherwise it is less efficient than mergingDataFrames
directly.