Description
Given a dataframe df
id | date
---------------
1 | 2015-09-01
2 | 2015-09-01
1 | 2015-09-03
1 | 2015-09-04
2 | 2015-09-04
I want to create a running counter or index,
- grouped by the same id and
- sorted by date in that group,
thus
id | date | counter
--------------------------
1 | 2015-09-01 | 1
1 | 2015-09-03 | 2
1 | 2015-09-04 | 3
2 | 2015-09-01 | 1
2 | 2015-09-04 | 2
This is something I can achieve with window function, e.g.
val w = Window.partitionBy("id").orderBy("date")
val resultDF = df.select( df("id"), rowNumber().over(w) )
Unfortunately, Spark 1.4.1 does not support window functions for regular dataframes:
org.apache.spark.sql.AnalysisException: Could not resolve window function 'row_number'. Note that, using window functions currently requires a HiveContext;
Questions
- How can I achieve the above computation on current Spark 1.4.1 without using window functions?
- When will window functions for regular dataframes be supported in Spark?
Thanks!
You can do this with RDDs. Personally I find the API for RDDs makes a lot more sense - I don't always want my data to be 'flat' like a dataframe.
The above gives a result like the following:
If you want the position within the 'group' as well, you can use
zipWithIndex
.You could flatten this back to a simple List/Array of
Row
objects using FlatMap, but if you need to perform anything on the 'group' that won't be a great idea.The downside to using RDD like this is that it's tedious to convert from DataFrame to RDD and back again.
You can use
HiveContext
for localDataFrames
as well and, unless you have a very good reason not to, it is probably a good idea anyway. It is a defaultSQLContext
available inspark-shell
andpyspark
shell (as for nowsparkR
seems to use plainSQLContext
) and its parser is recommended by Spark SQL and DataFrame Guide.I totally agree that Window functions for DataFrames are the way to go if you have Spark version (>=)1.5. But if you are really stuck with an older version(e.g 1.4.1), here is a hacky way to solve this
Now if you do
dfWithCounter.show
You will get:
Note that
date
is not sorted, but thecounter
is correct. Also you can change the ordering of thecounter
by changing the<=
to>=
in thewhere
statement.