Trying to figure out how to use window functions in PySpark. Here's an example of what I'd like to be able to do, simply count the number of times a user has an "event" (in this case "dt" is a simulated timestamp).
from pyspark.sql.window import Window
from pyspark.sql.functions import count
df = sqlContext.createDataFrame([{"id": 123, "dt": 0}, {"id": 123, "dt": 1}, {"id": 234, "dt":0}, {"id": 456, "dt":0}, {"id": 456, "dt":1}, {"id":456, "dt":2}])
df.select(["id","dt"], count("dt").over(Window.partitionBy("id").orderBy("dt")).alias("count")).show()
This produces an error. What is the correct way to use window functions? I read that 1.4.1 (the version we need to use since it's what is standard on AWS) should be able to do them with the DataFrame API.
FWIW, the documentation is pretty sparse on this subject. And I had trouble getting any examples actually running.