适用于UDF多列，并使用numpy的操作(apply udf to multiple columns

我有一个名为结果pyspark数据帧，我想申请一个UDF来如下创建新列：

result = sqlContext.createDataFrame([(138,5,10), (128,4,10), (112,3,10), (120,3,10), (189,1,10)]).withColumnRenamed("_1","count").withColumnRenamed("_2","df").withColumnRenamed("_3","docs")
@udf("float")
def newFunction(arr):
    return (1 + np.log(arr[0])) * np.log(arr[2]/arr[1])

result=result.withColumn("new_function_result",newFunction_udf(array("count","df","docs")))

列数，DF，文档都是整数columns.but这个回报

Py4JError：org.apache.spark.sql.functions.col：同时呼吁ž发生错误。跟踪：py4j.Py4JException：方法山口（[类java.util.ArrayList]）在py4j.reflection.ReflectionEngine.getMethod（ReflectionEngine.java:318）在py4j.reflection.ReflectionEngine.getMethod不存在（ReflectionEngine.java:339 ）在py4j.Gateway.invoke（Gateway.java:274）在py4j.commands.AbstractCommand.invokeMethod（AbstractCommand.java:132）在py4j.commands.CallCommand.execute（CallCommand.java:79）在py4j.GatewayConnection.run （GatewayConnection.java:214）在java.lang.Thread.run（Thread.java:748）

当我试图通过一列，得到这些的平方，它工作正常。

任何帮助表示赞赏。

该错误消息是误导性的，但想告诉你，你的函数不返回浮动。你的函数返回值类型numpy.float64您可以用VectorUDT读取类型（功能： newFunctionVector在下面的例子）。利用numpy的的另一种方式是通过铸造numpy的类型numpy.float64到Python类型的浮动（功能： newFunctionWithArray在下面的示例）。

最后但并非最不重要的，它不是必要调用阵列作为UDF可以使用多于一个的参数（功能： newFunction在下面的示例）。

import numpy as np
from pyspark.sql.functions import udf, array
from pyspark.sql.types import FloatType
from pyspark.mllib.linalg import Vectors, VectorUDT

result = sqlContext.createDataFrame([(138,5,10), (128,4,10), (112,3,10), (120,3,10), (189,1,10)], ["count","df","docs"])

def newFunctionVector(arr):
    return (1 + np.log(arr[0])) * np.log(arr[2]/arr[1])

@udf("float")
def newFunctionWithArray(arr):
    returnValue = (1 + np.log(arr[0])) * np.log(arr[2]/arr[1])
    return returnValue.item()

@udf("float")
def newFunction(count, df, docs):
    returnValue = (1 + np.log(count)) * np.log(docs/df)
    return returnValue.item()


vector_udf = udf(newFunctionVector, VectorUDT())

result=result.withColumn("new_function_result", newFunction("count","df","docs"))

result=result.withColumn("new_function_result_WithArray", newFunctionWithArray(array("count","df","docs")))

result=result.withColumn("new_function_result_Vector", newFunctionWithArray(array("count","df","docs")))

result.printSchema()

result.show()

输出：

root 
|-- count: long (nullable = true) 
|-- df: long (nullable = true) 
|-- docs: long (nullable = true) 
|-- new_function_result: float (nullable = true) 
|-- new_function_result_WithArray: float (nullable = true) 
|-- new_function_result_Vector: float (nullable = true)

+-----+---+----+-------------------+-----------------------------+--------------------------+ 
|count| df|docs|new_function_result|new_function_result_WithArray|new_function_result_Vector|
+-----+---+----+-------------------+-----------------------------+--------------------------+ 
|  138|  5|  10|           4.108459|                     4.108459|                  4.108459| 
|  128|  4|  10|           5.362161|                     5.362161|                  5.362161|
|  112|  3|  10|          6.8849173|                    6.8849173|                 6.8849173|
|  120|  3|  10|           6.967983|                     6.967983|                  6.967983|
|  189|  1|  10|          14.372153|                    14.372153|                 14.372153|  
+-----+---+----+-------------------+-----------------------------+--------------------------+