How to use window functions in PySpark using DataF

2019-02-17 16:01发布

问题:

Trying to figure out how to use window functions in PySpark. Here's an example of what I'd like to be able to do, simply count the number of times a user has an "event" (in this case "dt" is a simulated timestamp).

from pyspark.sql.window import Window
from pyspark.sql.functions import count

df = sqlContext.createDataFrame([{"id": 123, "dt": 0}, {"id": 123, "dt": 1}, {"id": 234, "dt":0}, {"id": 456, "dt":0}, {"id": 456, "dt":1}, {"id":456, "dt":2}])
df.select(["id","dt"], count("dt").over(Window.partitionBy("id").orderBy("dt")).alias("count")).show()

This produces an error. What is the correct way to use window functions? I read that 1.4.1 (the version we need to use since it's what is standard on AWS) should be able to do them with the DataFrame API.

FWIW, the documentation is pretty sparse on this subject. And I had trouble getting any examples actually running.

回答1:

It throws an exception because you pass a list of columns. Signature of DataFrame.select looks as follows

df.select(self, *cols)

and an expression using a window function is a column like any other so what you need here is something like this:

w = Window.partitionBy("id").orderBy("dt") # Just for clarity
df.select("id","dt", count("dt").over(w).alias("count")).show()

## +---+---+-----+
## | id| dt|count|
## +---+---+-----+
## |234|  0|    1|
## |456|  0|    1|
## |456|  1|    2|
## |456|  2|    3|
## |123|  0|    1|
## |123|  1|    2|
## +---+---+-----+

Generally speaking Spark SQL window functions behave exactly the same way as in any modern RDBMS.