I'm trying to transpose some columns of my table to row. I'm using Python and Spark 1.5.0. Here is my initial table:
+-----+-----+-----+-------+
| A |col_1|col_2|col_...|
+-----+-------------------+
| 1 | 0.0| 0.6| ... |
| 2 | 0.6| 0.7| ... |
| 3 | 0.5| 0.9| ... |
| ...| ...| ...| ... |
I would like to have somthing like this:
+-----+--------+-----------+
| A | col_id | col_value |
+-----+--------+-----------+
| 1 | col_1| 0.0|
| 1 | col_2| 0.6|
| ...| ...| ...|
| 2 | col_1| 0.6|
| 2 | col_2| 0.7|
| ...| ...| ...|
| 3 | col_1| 0.5|
| 3 | col_2| 0.9|
| ...| ...| ...|
Does someone know haw I can do it? Thank you for your help.
Use flatmap. Something like below should work
A very handy way to implement:
One way to solve with
pyspark sql
using functionscreate_map
andexplode
.It is relatively simple to do with basic Spark SQL functions.
Python
Scala:
The Spark local linear algebra libraries are presently very weak: and they do not include basic operations as the above.
There is a JIRA for fixing this for Spark 2.1 - but that will not help you today.
Something to consider: performing a transpose will likely require completely shuffling the data.
For now you will need to write RDD code directly. I have written
transpose
in scala - but not in python. Here is thescala
version:So you can convert that to python for your use. I do not have bandwidth to write/test that at this particular moment: let me know if you were unable to do that conversion.
At the least - the following are readily converted to
python
.zipWithIndex
-->enumerate()
(python equivalent - credit to @zero323)map
-->[someOperation(x) for x in ..]
groupBy
-->itertools.groupBy()
Here is the implementation for
flatten
which does not have a python equivalent:So you should be able to put those together for a solution.
I took the Scala answer that @javadba wrote and created a Python version for transposing all columns in a
DataFrame
. This might be a bit different from what OP was asking...For example: