Calculate hash without using exisiting hash fuctio

2019-01-29 04:00发布

I want to calculate hash for strings in hive without writing any UDF only using exisiting functions . So that I can use similar approach to get consistent hash in other languages. for ex : are there any functions using which I can do something like adding characters or taking Xor.

回答1:

It depends on the version of Hive, cf. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Misc.Functions

select XYZ, hash(XYZ) from ABC
has been available for years and applies plain old java.lang.String.hashCode(), returning an INT (32 bit hash)

[Edit 2] Actually it's a bit more complex since hash() accepts a list of arguments of any type (incl. primitive types that have no built-in hashing method), so a custom approach is used -- check ObjectInspectorUtils.hashCode() and ObjectInspectorUtils.getBucketHashCode() in the source code here (for V2.1)

select XYZ, crc32(XYZ) from ABC
requires Hive 1.3 and applies plain old Cyclic Redundancy Check (probably via java.util.zip.CRC32), returning a BIGINT (32 bit hash)

select XYZ, md5(XYZ), sha1(XYZ), sha2(XYZ,256), sha2(XYZ,512) from ABC
requires Hive 1.3 and applies strong, cryptographic hash functions, returning a STRING with the hexadecimal representation of the binary (128, 160, 256 and 512 bit hashes)

[Edit 1] the answer to that post has also a very good workaround for applying crypto hash functions with older versions of Hive, using Apache Commons static methods and reflect().