I found from JIRA that 1.6 release of SparkR
has implemented window functions including lag
and rank
, but over
function is not implemented yet. How can I use window function like lag
function without over
in SparkR
(not the SparkSQL
way)? Can someone provide an example?
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
回答1:
Spark 2.0.0+
SparkR provides DSL wrappers with over
, window.partitionBy
/ partitionBy
, window.orderBy
/ orderBy
and rowsBetween
/ rangeBeteen
functions.
Spark <= 1.6
Unfortunately it is not possible in 1.6.0. While some window functions, including lag
, have been implemented SparkR doesn't support window definitions yet which renders these completely useless.
As long as SPARK-11395 is not resolved the only option is to use raw SQL:
set.seed(1)
hc <- sparkRHive.init(sc)
sdf <- createDataFrame(hc, data.frame(x=1:12, y=1:3, z=rnorm(12)))
registerTempTable(sdf, "sdf")
sql(hc, "SELECT x, y, z, LAG(z) OVER (PARTITION BY y ORDER BY x) FROM sdf") %>%
head()
## x y z _c3
## 1 1 1 -0.6264538 NA
## 2 4 1 1.5952808 -0.6264538
## 3 7 1 0.4874291 1.5952808
## 4 10 1 -0.3053884 0.4874291
## 5 2 2 0.1836433 NA
## 6 5 2 0.3295078 0.1836433
Assuming that the corresponding PR will be merged without significant changes window definition and example query should look as follows:
w <- Window.partitionBy("y") %>% orderBy("x")
select(sdf, over(lag(sdf$z), w))