I've generated an md5 hash as below:
DECLARE @varchar varchar(400)
SET @varchar = 'è'
SELECT CONVERT(VARCHAR(2000), HASHBYTES( 'MD5', @varchar ), 2)
Which outputs:
785D512BE4316D578E6650613B45E934
However generating an MD5 hash using:
System.Text.Encoding.UTF8.GetBytes("è")
generates:
0a35e149dbbb2d10d744bf675c7744b1
The encoding in the C# .NET method is set to UTF8 and I had assumed that varchar was also UTF8, any ideas on what I'm doing wrong?
SQL Server uses UCS-2 rather than UTF-8 to encode character data.
If you were using an NVarChar field, the following would work:
For more information on SQL and C# hashing, see
http://weblogs.sqlteam.com/mladenp/archive/2009/04/28/Comparing-SQL-Server-HASHBYTES-function-and-.Net-hashing.aspx
If you are dealing with
NVARCHAR
/NCHAR
data (which is stored as UTF-16 Little Endian), then you would use theUnicode
encoding, notBigEndianUnicode
. In .NET, UTF-16 is calledUnicode
while other Unicode encodings are referred to by their actual names: UTF7, UTF8, and UTF32. Hence,Unicode
by itself isLittle Endian
as opposed toBigEndianUnicode
. UPDATE: Please see the section at the end regarding UCS-2 and Supplementary Characters.On the database side:
On the .NET side:
However, this question pertains to
VARCHAR
/CHAR
data, which is ASCII, and so things are a bit more complicated.On the database side:
We already see the .NET side above. From those hashed values there should be two questions:
HASHBYTES
value?ASCII
,UTF7
, andUTF8
) all match theHASHBYTES
value?There is one answer that covers both questions: Code Pages. The test done in the "sqlteam" article used "safe" ASCII characters that are in the 0 - 127 range (in terms of the int / decimal value) that do not vary between Code Pages. But the 128 - 255 range -- where we find the "è" character -- is the Extended set that does vary by Code Page (which makes sense as this is the reason for having Code Pages).
Now try:
That matches the
ASCII
hashed value (and again, because the "sqlteam" article / test used values in the 0 - 127 range, they did not see any changes when usingCOLLATE
). Great, now we finally found a way to matchVARCHAR
/CHAR
data. All good?Well, not really. Let's take a look-see at what we were actually hashing:
Returns:
A
?
? Just to verify, run:Ah, so Code Page 1255 doesn't have the
è
character, so it gets translated as everyone's favorite?
. But then why did that match the MD5 hashed value in .NET when using the ASCII encoding? Could it be that we weren't actually matching the hashed value ofè
, but instead were matching the hashed value of?
:Yup. The true ASCII character set is just the first 128 characters (values 0 - 127). And as we just saw, the
è
is 232. So, using theASCII
encoding in .NET is not that helpful. Nor was usingCOLLATE
on the T-SQL side.Is it possible to get a better encoding on the .NET side? Yes, by using Encoding.GetEncoding(Int32), which allows for specifying the Code Page. The Code Page to use can be discovered using the following query (use
sys.columns
when working with a column instead of a literal or variable):The query above returns (for me):
So, let's try Code Page 1252:
Woo hoo! We have a match for
VARCHAR
data that uses our default SQL Server collation :). Of course, if the data is coming from a database or field set to a different collation, thenGetEncoding(1252)
might not work and you will have to find the actual matching Code Page using the query shown above (a Code Page is used across many Collations, so a different Collation does not necessarily imply a different Code Page).To see what the possible Code Page values are, and what culture / locale they pertain to, please see the list of Code Pages here (list is in the "Remarks" section).
Additional info related to what is actually stored in
NVARCHAR
/NCHAR
fields:Any UTF-16 character (2 or 4 bytes) can be stored, though the default behavior of the built-in functions assumes that all characters are UCS-2 (2 bytes each), which is a subset of UTF-16. Starting in SQL Server 2012, it is possible to access a set of Windows collations that support the 4 byte characters known as Supplementary Characters. Using one of these Windows collations ending in
_SC
, either specified for a column or directly in a query, will allow the built-in functions to properly handle the 4 byte characters.sql server hashbytes allways works like System.Text.Encoding.Unicode on unicode characters like arabic persian ,... if you use Utf8.Unicode Or Ascii.Unicode You will see the diffrence and if you use Utf8.Unicode the return result of sql server and c# will be same
I was having the same issue, and as @srutzky comments, what might be happening is that I didn't preceed the query with a capital-N, and I was getting an 8-bit Extended ASCII ( VARCHAR / string not prefixed with capital-N ) instead of a 16-bit UTF-16 Little Endian ( NVARCHAR / string prefixed with capital-N )
If you do:
It will output: E99A18C428CB38D5F260853678922E03
But if you do this, having the same password ('abc123'):
It will output: 6E9B3A7620AAF77F362775150977EEB8
What I should have done is:
That outputs the same result: 6E9B3A7620AAF77F362775150977EEB8