Creating a hashcode for use in a database (ie not

2019-02-26 02:15发布

问题:

I have recently been instructed in the ways of GetHashCode() and in particular "Consumers of GetHashCode cannot rely upon it being stable over time or across appdomains" (From an Eric Lippert blog article).

Unfortuantely I have been using this in a database to try to speed up lookups (by inserting the result of GetHashCode rather than doing searches on text strings). I am now aware that this is a very bad thing to do.

So I'm left wondering what there is that I can do instead. Is there anything that given a string will be guaranteed to return a sensibly collision resistant integer that I can use for lookups?

I could write something myself but I was hoping that there would be something built in that I could use without having to go for stuff in the cryptographic libraries which feels a bit heavyweight.

回答1:

I would encourage you to consider what the others have said: let the database do what it's good at. Creating a hash code in order to optimize lookups is an indication that the indexes on your table aren't what they should be.

That said, if you really need a hash code:

You don't say if you want a 32-bit or 64-bit hash code. This one will create a 64-bit hash code for a string. It's reasonably collision-resistant.

public static long ComputeHashCode(string url)
{
    const ulong p = 1099511628211;

    ulong hash = 14695981039346656037;

    for (int i = 0; i < url.Length; ++i)
    {
        hash = (hash ^ url[i]) * p;
    }

    // Wang64 bit mixer
    hash = (~hash) + (hash << 21);
    hash = hash ^ (hash >> 24);
    hash = (hash + (hash << 3)) + (hash << 8);
    hash = hash ^ (hash >> 14);
    hash = (hash + (hash << 2)) + (hash << 4);
    hash = hash ^ (hash >> 28);
    hash = hash + (hash << 31);

    if (hash == (ulong)UNKNOWN_RECORD_HASH)
    {
        ++hash;
    }
    return (long)hash;
}

Note that this is a hash code and the likelihood of a collision is pretty small if you have up to a few billion records. Rule of thumb: you have a 50% chance of collision when the number of items exceeds the square root of your hash code's range. This hash code has a range of 2^64, so if you have 2^32 items, your chance of a collision is about 50%.

See http://www.informit.com/guides/content.aspx?g=dotnet&seqNum=792 and http://en.wikipedia.org/wiki/Birthday_paradox#Probability_table for more information.



回答2:

As SLaks pointed out in a comment, looking up data is what databases are good at.

If you need fast lookups, create an index on the column. At the very least, you won't have to deal with collisions anymore.



回答3:

Are you using a MSSQL Database? The T-SQL Checksum function does exactly that.



标签: c# .net hash