I would like to build one UDF from two already working functions. I'm trying to calculate a md5 hash as a new column to an existing Spark Dataframe.
def md5(s: String): String = { toHex(MessageDigest.getInstance("MD5").digest(s.getBytes("UTF-8")))}
def toHex(bytes: Array[Byte]): String = bytes.map("%02x".format(_)).mkString("")
Structure (what i have so far)
val md5_hash: // UDF Implementation
val sqlfunc = udf(md5_hash)
val new_df = load_df.withColumn("New_MD5_Column", sqlfunc(col("Duration")))
Unfortunately i don't know how to propably implement the function as UDF.
Why not using the built-in md5 function?
You could then use it as follows:
You have to make sure that the column is of binary type so in case it's int you may see the following error:
You should then change the type to be
md5
-compatible, i.e. binary type, using bin function.A solution could be as follows then:
You could also "chain" functions together and do the conversion and calculating MD5 in one
withColumn
, but I prefer to keep steps separate in case there's an issue to resolve and having intermediate steps usually helps.Performance
The reason why you would consider using the build-in functions
bin
andmd5
over custom user-defined functions (UDFs) is that you could get a better performance as Spark SQL is in full control and would not add extra steps for serialization to and deserialization from an internal row representation.It is not the case here, but still requires less to import and work with.
you can use following
udf
function named asmd5