How to create UDF from Scala methods (to compute m

I would like to build one UDF from two already working functions. I'm trying to calculate a md5 hash as a new column to an existing Spark Dataframe.

def md5(s: String): String = { toHex(MessageDigest.getInstance("MD5").digest(s.getBytes("UTF-8")))}
def toHex(bytes: Array[Byte]): String = bytes.map("%02x".format(_)).mkString("")

Structure (what i have so far)

val md5_hash: // UDF Implementation
val sqlfunc = udf(md5_hash)
val new_df = load_df.withColumn("New_MD5_Column", sqlfunc(col("Duration")))

Unfortunately i don't know how to propably implement the function as UDF.

标签： scala apache-spark apache-spark-sql udf

2条回答

欢心

2楼-- · 2019-06-24 07:06

Why not using the built-in md5 function?

md5(e: Column): Column Calculates the MD5 digest of a binary column and returns the value as a 32 character hex string.

You could then use it as follows:

val new_df = load_df.withColumn("New_MD5_Column", md5($"Duration"))

You have to make sure that the column is of binary type so in case it's int you may see the following error:

org.apache.spark.sql.AnalysisException: cannot resolve 'md5(Duration)' due to data type mismatch: argument 1 requires binary type, however, 'Duration' is of int type.;;

You should then change the type to be md5-compatible, i.e. binary type, using bin function.

bin(e: Column): Column An expression that returns the string representation of the binary value of the given long column. For example, bin("12") returns "1100".

A solution could be as follows then:

val solution = load_df.
  withColumn("bin_duration", bin($"duration")).
  withColumn("md5", md5($"bin_duration"))
scala> solution.show(false)
+--------+------------+--------------------------------+
|Duration|bin_duration|md5                             |
+--------+------------+--------------------------------+
|1       |1           |c4ca4238a0b923820dcc509a6f75849b|
+--------+------------+--------------------------------+

You could also "chain" functions together and do the conversion and calculating MD5 in one withColumn, but I prefer to keep steps separate in case there's an issue to resolve and having intermediate steps usually helps.

Performance

The reason why you would consider using the build-in functions bin and md5 over custom user-defined functions (UDFs) is that you could get a better performance as Spark SQL is in full control and would not add extra steps for serialization to and deserialization from an internal row representation.

It is not the case here, but still requires less to import and work with.

0人赞添加讨论(0) 举报

干净又极端

3楼-- · 2019-06-24 07:09

you can use following udf function named as md5

import org.apache.spark.sql.functions._
def toHex(bytes: Array[Byte]): String = bytes.map("%02x".format(_)).mkString("")
def md5 = udf((s: String) => toHex(MessageDigest.getInstance("MD5").digest(s.getBytes("UTF-8"))))

val new_df = load_df.withColumn("New_MD5_Column", md5(col("Duration")))

0人赞添加讨论(0) 举报

How to create UDF from Scala methods (to compute m

Performance

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间