Transpose column to row with Spark

2019-01-01 03:53发布

I'm trying to transpose some columns of my table to row. I'm using Python and Spark 1.5.0. Here is my initial table:

|  A  |col_1|col_2|col_...|
|  1  |  0.0|  0.6|  ...  |
|  2  |  0.6|  0.7|  ...  |
|  3  |  0.5|  0.9|  ...  |
|  ...|  ...|  ...|  ...  |

I would like to have somthing like this:

|  A  | col_id | col_value |
|  1  |   col_1|        0.0|
|  1  |   col_2|        0.6|   
|  ...|     ...|        ...|    
|  2  |   col_1|        0.6|
|  2  |   col_2|        0.7| 
|  ...|     ...|        ...|  
|  3  |   col_1|        0.5|
|  3  |   col_2|        0.9|
|  ...|     ...|        ...|

Does someone know haw I can do it? Thank you for your help.

2楼-- · 2019-01-01 04:21

Use flatmap. Something like below should work

from pyspark.sql import Row

def rowExpander(row):
    rowDict = row.asDict()
    valA = rowDict.pop('A')
    for k in rowDict:
        yield Row(**{'A': valA , 'colID': k, 'colValue': row[k]})

newDf = sqlContext.createDataFrame(df.rdd.flatMap(rowExpander))
3楼-- · 2019-01-01 04:27

A very handy way to implement:

from pyspark.sql import Row

def rowExpander(row):
    rowDict = row.asDict()
    valA = rowDict.pop('A')
    for k in rowDict:
        yield Row(**{'A': valA , 'colID' : k, 'colValue' : row[k]})

    newDf = sqlContext.createDataFrame(df.rdd.flatMap(rowExpander)
4楼-- · 2019-01-01 04:30

One way to solve with pyspark sql using functions create_map and explode.

from pyspark.sql import functions as func
#Use `create_map` to create the map of columns with constant 
df = df.withColumn('mapCol', \
#Use explode function to explode the map 
res ='*',func.explode(df.mapCol).alias('col_id','col_value'))
5楼-- · 2019-01-01 04:31

It is relatively simple to do with basic Spark SQL functions.


from pyspark.sql.functions import array, col, explode, struct, lit

df = sc.parallelize([(1, 0.0, 0.6), (1, 0.6, 0.7)]).toDF(["A", "col_1", "col_2"])

def to_long(df, by):

    # Filter dtypes and split into column names and type description
    cols, dtypes = zip(*((c, t) for (c, t) in df.dtypes if c not in by))
    # Spark SQL supports only homogeneous columns
    assert len(set(dtypes)) == 1, "All columns have to be of the same type"

    # Create and explode an array of (column_name, column_value) structs
    kvs = explode(array([
      struct(lit(c).alias("key"), col(c).alias("val")) for c in cols

    return + [kvs]).select(by + ["kvs.key", "kvs.val"])

to_long(df, ["A"])


import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.{array, col, explode, lit, struct}

val df = Seq((1, 0.0, 0.6), (1, 0.6, 0.7)).toDF("A", "col_1", "col_2")

def toLong(df: DataFrame, by: Seq[String]): DataFrame = {
  val (cols, types) = df.dtypes.filter{ case (c, _) => !by.contains(c)}.unzip
  require(types.distinct.size == 1, s"${types.distinct.toString}.length != 1")      

  val kvs = explode(array( => struct(lit(c).alias("key"), col(c).alias("val"))): _*

  val byExprs =

    .select(byExprs :+ kvs.alias("_kvs"): _*)
    .select(byExprs ++ Seq($"_kvs.key", $"_kvs.val"): _*)

toLong(df, Seq("A"))
6楼-- · 2019-01-01 04:31

The Spark local linear algebra libraries are presently very weak: and they do not include basic operations as the above.

There is a JIRA for fixing this for Spark 2.1 - but that will not help you today.

Something to consider: performing a transpose will likely require completely shuffling the data.

For now you will need to write RDD code directly. I have written transpose in scala - but not in python. Here is the scala version:

 def transpose(mat: DMatrix) = {
    val nCols = mat(0).length
    val matT = mat
      .groupBy {
      _._2 % nCols
      .toSeq.sortBy {

So you can convert that to python for your use. I do not have bandwidth to write/test that at this particular moment: let me know if you were unable to do that conversion.

At the least - the following are readily converted to python.

  • zipWithIndex --> enumerate() (python equivalent - credit to @zero323)
  • map --> [someOperation(x) for x in ..]
  • groupBy --> itertools.groupBy()

Here is the implementation for flatten which does not have a python equivalent:

  def flatten(L):
        for item in L:
                for i in flatten(item):
                    yield i
            except TypeError:
                yield item

So you should be able to put those together for a solution.

7楼-- · 2019-01-01 04:38

I took the Scala answer that @javadba wrote and created a Python version for transposing all columns in a DataFrame. This might be a bit different from what OP was asking...

from itertools import chain
from pyspark.sql import DataFrame

def _sort_transpose_tuple(tup):
    x, y = tup
    return x, tuple(zip(*sorted(y, key=lambda v_k: v_k[1], reverse=False)))[0]

def transpose(X):
    """Transpose a PySpark DataFrame.

    X : PySpark ``DataFrame``
        The ``DataFrame`` that should be tranposed.
    # validate
    if not isinstance(X, DataFrame):
        raise TypeError('X should be a DataFrame, not a %s' 
                        % type(X))

    cols = X.columns
    n_features = len(cols)

    # Sorry for this unreadability...
    return X.rdd.flatMap( # make into an RDD
        lambda xs: chain(xs)).zipWithIndex().groupBy( # zip index
        lambda val_idx: val_idx[1] % n_features).sortBy( # group by index % n_features as key
        lambda grp_res: grp_res[0]).map( # sort by index % n_features key
        lambda grp_res: _sort_transpose_tuple(grp_res)).map( # maintain order
        lambda key_col: key_col[1]).toDF() # return to DF

For example:

>>> X = sc.parallelize([(1,2,3), (4,5,6), (7,8,9)]).toDF()
| _1| _2| _3|
|  1|  2|  3|
|  4|  5|  6|
|  7|  8|  9|

>>> transpose(X).show()
| _1| _2| _3|
|  1|  4|  7|
|  2|  5|  8|
|  3|  6|  9|
登录 后发表回答