first_value windowing function in pyspark

2019-02-19 10:28发布

站内文章 / Spark

37 0

迷人小祖宗

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I am using pyspark 1.5 getting my data from Hive tables and trying to use windowing functions.

According to this there exists an analytic function called firstValue that will give me the first non-null value for a given window. I know this exists in Hive but I can not find this in pyspark anywhere.

Is there a way to implement this given that pyspark won't allow UserDefinedAggregateFunctions (UDAFs)?

回答1:

Spark >= 2.0:

first takes an optional ignorenulls argument which can mimic the behavior of first_value:

df.select(col("k"), first("v", True).over(w).alias("fv"))

Spark < 2.0:

Available function is called first and can be used as follows:

df = sc.parallelize([
    ("a", None), ("a", 1), ("a", -1), ("b", 3)
]).toDF(["k", "v"])

w = Window().partitionBy("k").orderBy("v")

df.select(col("k"), first("v").over(w).alias("fv"))

but if you want to ignore nulls you'll have to use Hive UDFs directly:

df.registerTempTable("df")

sqlContext.sql("""
    SELECT k, first_value(v, TRUE) OVER (PARTITION BY k ORDER BY v)
    FROM df""")

标签： apache-spark pyspark apache-spark-sql window-functions

迷人小祖宗

女 | 书童

私信

收藏的人(0)

Ta的文章更多文章

0条评论

还没有人评论过~

first_value windowing function in pyspark

问题:

回答1:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮