How to create UDF from Scala methods (to compute m

2019-06-24 06:40发布

I would like to build one UDF from two already working functions. I'm trying to calculate a md5 hash as a new column to an existing Spark Dataframe.

def md5(s: String): String = { toHex(MessageDigest.getInstance("MD5").digest(s.getBytes("UTF-8")))}
def toHex(bytes: Array[Byte]): String = bytes.map("%02x".format(_)).mkString("")

Structure (what i have so far)

val md5_hash: // UDF Implementation
val sqlfunc = udf(md5_hash)
val new_df = load_df.withColumn("New_MD5_Column", sqlfunc(col("Duration")))

Unfortunately i don't know how to propably implement the function as UDF.

2条回答
欢心
2楼-- · 2019-06-24 07:06

Why not using the built-in md5 function?

md5(e: Column): Column Calculates the MD5 digest of a binary column and returns the value as a 32 character hex string.

You could then use it as follows:

val new_df = load_df.withColumn("New_MD5_Column", md5($"Duration"))

You have to make sure that the column is of binary type so in case it's int you may see the following error:

org.apache.spark.sql.AnalysisException: cannot resolve 'md5(Duration)' due to data type mismatch: argument 1 requires binary type, however, 'Duration' is of int type.;;

You should then change the type to be md5-compatible, i.e. binary type, using bin function.

bin(e: Column): Column An expression that returns the string representation of the binary value of the given long column. For example, bin("12") returns "1100".

A solution could be as follows then:

val solution = load_df.
  withColumn("bin_duration", bin($"duration")).
  withColumn("md5", md5($"bin_duration"))
scala> solution.show(false)
+--------+------------+--------------------------------+
|Duration|bin_duration|md5                             |
+--------+------------+--------------------------------+
|1       |1           |c4ca4238a0b923820dcc509a6f75849b|
+--------+------------+--------------------------------+

You could also "chain" functions together and do the conversion and calculating MD5 in one withColumn, but I prefer to keep steps separate in case there's an issue to resolve and having intermediate steps usually helps.

Performance

The reason why you would consider using the build-in functions bin and md5 over custom user-defined functions (UDFs) is that you could get a better performance as Spark SQL is in full control and would not add extra steps for serialization to and deserialization from an internal row representation.

enter image description here

It is not the case here, but still requires less to import and work with.

查看更多
干净又极端
3楼-- · 2019-06-24 07:09

you can use following udf function named as md5

import org.apache.spark.sql.functions._
def toHex(bytes: Array[Byte]): String = bytes.map("%02x".format(_)).mkString("")
def md5 = udf((s: String) => toHex(MessageDigest.getInstance("MD5").digest(s.getBytes("UTF-8"))))

val new_df = load_df.withColumn("New_MD5_Column", md5(col("Duration")))
查看更多
登录 后发表回答