How to change a column position in a spark datafra

2019-02-04 07:32发布

问题:

I was wondering if it is possible to change the position of a column in a dataframe, actually to change the schema ?

Precisely if i have got a dataframe like [field1, field2, field3], and i would like to get [field1, field3, field2].

Any help would be much appreciated !

Thanks.

Edit :

I can't put any piece of code. Let us imagine we're working with a dataframe with one hundred columns, after some joins and transformations, some of these columns are misplaced regarding the schema of the destination table. So my point is : how to move one or several columns, i.e : how to change the schema ?

Thanks.

回答1:

You can get the column names, reorder them however you want, and then use select on the original DataFrame to get a new one with this new order:

val columns: Array[String] = dataFrame.columns
val reorderedColumnNames: Array[String] = ??? // do the reordering you want
val result: DataFrame = dataFrame.select(reorderedColumnNames.head, reorderedColumnNames.tail: _*)


回答2:

A tiny different version compare to @Tzach Zohar

val cols = df.columns.map(df(_)).reverse
val reversedColDF = df.select(cols:_*)


回答3:

The spark-daria library has a reorderColumns method that makes it easy to reorder the columns in a DataFrame.

import com.github.mrpowers.spark.daria.sql.DataFrameExt._

val actualDF = sourceDF.reorderColumns(
  Seq("field1", "field3", "field2")
)

The reorderColumns method uses @Rockie Yang's solution under the hood.

If you want to get the column ordering of df1 to equal the column ordering of df2, something like this should work better than hardcoding all the columns:

df1.reorderColumns(df2.columns)

The spark-daria library also defines a sortColumns transformation to sort columns in ascending or descending order (if you don't want to specify all the column in a sequence).

import com.github.mrpowers.spark.daria.sql.transformations._

df.transform(sortColumns("asc"))


回答4:

Like others have commented, I'm curious to know why would you do this as the order is not relevant when you can query the columns by their names.

Anyway, using a select should give the feeling the columns have moved in schema description:

val data = Seq(
  ("a",       "hello", 1),
  ("b",       "spark", 2)
)
.toDF("field1", "field2", "field3")

data
 .show()

data
 .select("field3", "field2", "field1")
 .show()