I am starting to use Spark Dataframes and I need to be able to pivot the data to create multiple columns out of 1 column with multiple rows. There is built in functionality for that in Scalding and I believe in Pandas in python, but I can't find anything for the new Spark Dataframe.
I assume I can write custom function of some sort that will do this but I'm not even sure how to start, especially since I am a novice with Spark. I anyone knows how to do this with built in functionality or suggestions for how to write something in Scala, it is greatly appreciated.
A pivot operator has been added to the Spark dataframe API, and is part of Spark 1.6.
See https://github.com/apache/spark/pull/7841 for details.
As mentioned by David Anderson Spark provides
pivot
function since version 1.6. General syntax looks as follows:Usage examples using
nycflights13
andcsv
format:Python:
Scala:
Java:
R / SparkR:
R / sparklyr
SQL:
Example data:
Performance considerations:
Generally speaking pivoting is an expensive operation.
if you can try to provide
values
list:in some cases it proved to be beneficial (likely no longer worth the effort in 2.0 or later) to
repartition
and / or pre-aggregate the datafor reshaping only, you can use
first
: How to use pivot and calculate average on a non-numeric column (facing AnalysisException "is not a numeric column")?Related questions:
There is simple and elegant solution.
I have solved a similar problem using dataframes with the following steps:
Create columns for all your countries, with 'value' as the value:
Your dataframe 'dfWithCountries' will look like this:
Now you can sum together all the values for your desired result:
Result:
It's not a very elegant solution though. I had to create a chain of functions to add in all the columns. Also if I have lots of countries, I will expand my temporary data set to a very wide set with lots of zeroes.
Initially i adopted Al M's solution. Later took the same thought and rewrote this function as a transpose function.
This method transposes any df rows to columns of any data-format with using key and value column
for input csv
ouput
transpose method :
main snippet
I overcame this by writing a for loop to dynamically create a SQL query. Say I have:
and I want:
I can create a list with the value I want to pivot and then create a string containing the SQL query I need.
I can create similar query to then do the aggregation. Not a very elegant solution but it works and is flexible for any list of values, which can also be passed in as an argument when your code is called.