I'm trying to add a column to a dataframe, which will contain hash of another column.
I've found this piece of documentation:
https://spark.apache.org/docs/2.3.0/api/sql/index.html#hash
And tried this:
import org.apache.spark.sql.functions._
val df = spark.read.parquet(...)
val withHashedColumn = df.withColumn("hashed", hash($"my_column"))
But what is the hash function used by that hash()
? Is that murmur
, sha
, md5
, something else?
The value I get in this column is integer, thus range of values here is probably [-2^(31) ... +2^(31-1)]
.
Can I get a long value here? Can I get a string hash instead?
How can I specify a concrete hashing algorithm for that?
Can I use a custom hash function?