I have an SQL table in which I store large string values that must be unique. In order to ensure the uniqueness, I have a unique index on a column in which I store a string representation of the MD5 hash of the large string.
The C# app that saves these records uses the following method to do the hashing:
public static string CreateMd5HashString(byte[] input)
{
var hashBytes = MD5.Create().ComputeHash(input);
return string.Join("", hashBytes.Select(b => b.ToString("X")));
}
In order to call this, I first convert the string
to byte[]
using the UTF-8 encoding:
// this is what I use in my app
CreateMd5HashString(Encoding.UTF8.GetBytes("abc"))
// result: 90150983CD24FB0D6963F7D28E17F72
Now I would like to be able to implement this hashing function in SQL, using the HASHBYTES
function, but I get a different value:
print hashbytes('md5', N'abc')
-- result: 0xCE1473CF80C6B3FDA8E3DFC006ADC315
This is because SQL computes the MD5 of the UTF-16 representation of the string.
I get the same result in C# if I do CreateMd5HashString(Encoding.Unicode.GetBytes("abc"))
.
I cannot change the way hashing is done in the application.
Is there a way to get SQL Server to compute the MD5 hash of the UTF-8 bytes of the string?
I looked up similar questions, I tried using collations, but had no luck so far.
SQL Server does not natively support using UTF-8 strings, and it hasn't for quite a while. As you noticed, NCHAR and NVARCHAR use UCS-2 rather than UTF-8.
If you are insistent on using the
HASHBYTES
function, you must be able to pass the UTF-8byte[]
asVARBINARY
from your C# code to preserve the encoding.HASHBYTES
acceptsVARBINARY
in place ofNVARCHAR
. This could be accomplished with a CLR function that acceptsNVARCHAR
and returns the results ofEncoding.UTF8.GetBytes
asVARBINARY
.With that being said, I strongly suggest keeping these types of business rules isolated within your application rather than the database. Especially since the application is already performing this logic.
You need to create a UDF to convert the NVARCHAR data to bytes in UTF-8 Representation. Say it is called
dbo.NCharToUTF8Binary
then you can do:Here is a UDF which will do that: