我有一个名为结果pyspark数据帧,我想申请一个UDF来如下创建新列:
result = sqlContext.createDataFrame([(138,5,10), (128,4,10), (112,3,10), (120,3,10), (189,1,10)]).withColumnRenamed("_1","count").withColumnRenamed("_2","df").withColumnRenamed("_3","docs")
@udf("float")
def newFunction(arr):
return (1 + np.log(arr[0])) * np.log(arr[2]/arr[1])
result=result.withColumn("new_function_result",newFunction_udf(array("count","df","docs")))
列数,DF,文档都是整数columns.but这个回报
Py4JError:org.apache.spark.sql.functions.col:同时呼吁ž发生错误。 跟踪:py4j.Py4JException:方法山口([类java.util.ArrayList])在py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)在py4j.reflection.ReflectionEngine.getMethod不存在(ReflectionEngine.java:339 )在py4j.Gateway.invoke(Gateway.java:274)在py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)在py4j.commands.CallCommand.execute(CallCommand.java:79)在py4j.GatewayConnection.run (GatewayConnection.java:214)在java.lang.Thread.run(Thread.java:748)
当我试图通过一列,得到这些的平方,它工作正常。
任何帮助表示赞赏。
该错误消息是误导性的,但想告诉你,你的函数不返回浮动。 你的函数返回值类型numpy.float64
您可以用VectorUDT读取类型(功能: newFunctionVector
在下面的例子)。 利用numpy的的另一种方式是通过铸造numpy的类型numpy.float64
到Python类型的浮动(功能: newFunctionWithArray
在下面的示例)。
最后但并非最不重要的,它不是必要调用阵列作为UDF可以使用多于一个的参数(功能: newFunction
在下面的示例)。
import numpy as np
from pyspark.sql.functions import udf, array
from pyspark.sql.types import FloatType
from pyspark.mllib.linalg import Vectors, VectorUDT
result = sqlContext.createDataFrame([(138,5,10), (128,4,10), (112,3,10), (120,3,10), (189,1,10)], ["count","df","docs"])
def newFunctionVector(arr):
return (1 + np.log(arr[0])) * np.log(arr[2]/arr[1])
@udf("float")
def newFunctionWithArray(arr):
returnValue = (1 + np.log(arr[0])) * np.log(arr[2]/arr[1])
return returnValue.item()
@udf("float")
def newFunction(count, df, docs):
returnValue = (1 + np.log(count)) * np.log(docs/df)
return returnValue.item()
vector_udf = udf(newFunctionVector, VectorUDT())
result=result.withColumn("new_function_result", newFunction("count","df","docs"))
result=result.withColumn("new_function_result_WithArray", newFunctionWithArray(array("count","df","docs")))
result=result.withColumn("new_function_result_Vector", newFunctionWithArray(array("count","df","docs")))
result.printSchema()
result.show()
输出:
root
|-- count: long (nullable = true)
|-- df: long (nullable = true)
|-- docs: long (nullable = true)
|-- new_function_result: float (nullable = true)
|-- new_function_result_WithArray: float (nullable = true)
|-- new_function_result_Vector: float (nullable = true)
+-----+---+----+-------------------+-----------------------------+--------------------------+
|count| df|docs|new_function_result|new_function_result_WithArray|new_function_result_Vector|
+-----+---+----+-------------------+-----------------------------+--------------------------+
| 138| 5| 10| 4.108459| 4.108459| 4.108459|
| 128| 4| 10| 5.362161| 5.362161| 5.362161|
| 112| 3| 10| 6.8849173| 6.8849173| 6.8849173|
| 120| 3| 10| 6.967983| 6.967983| 6.967983|
| 189| 1| 10| 14.372153| 14.372153| 14.372153|
+-----+---+----+-------------------+-----------------------------+--------------------------+