If I need to retrieve a large string from a DB, Is it faster to search for it using the string itself or would I gain by hashing the string and storing the hash in the DB as well and then search based on that?
If yes what hash algorithm should I use (security is not an issue, I am looking for performance)
If it matters: I am using C# and MSSQL2005
If you use a fixed length field and an index it will probably be faster...
I'd be surprised if this offered huge improvement and I would recommend not using your own performance optimisations for a DB search.
If you use a database index there is scope for performance to be tuned by a DBA using tried and trusted methods. Hard coding your own index optimisation will prevent this and may stop you gaining for any performance improvements in indexing in future versions of the DB.
If your strings are short (less than 100 charaters in general), strings will be faster.
If the strings are large,
HASH
search may and most probably will be faster.HashBytes(MD4)
seems to be the fastest onDML
.In general: probably not, assuming the column is indexed. Database servers are designed to do such lookups quickly and efficiently. Some databases (e.g. Oracle) provide options to build indexes based on hashing.
However, in the end this can be only answered by performance testing with representative (of your requirements) data and usage patterns.
First - MEASURE it. That is the only way to tell for sure.
Second - If you don't have an issue with the speed of the string searching, then keep it simple and don't use a Hash.
However, for your actual question (and just because it is an interesting thought). It depends on how similar the strings are. Remember that the DB engine doesn't need to compare all the characters in a string, only enough to find a difference. If you are looking through 10 million strings that all start with the same 300 characters then the hash will almost certainly be faster. If however you are looking for the only string that starts with an x, then i the string comparison could be faster. I think though that SQL will still have to get the entire string from disc, even if it then only uses the first byte (or first few bytes for multi byte characters), so the total string length will still have an impact.
If you are trying the hash comparison then you should make the hash an indexed calculated column. It will not be faster if you are working out the hashes for all the strings each time you run a query!
You could also consider using SQL's CRC function. It produces an int, which will be even quicker to comapre and is faster to calculate. But you will have to double check the results of this query by actually testing the string values because the CRC function is not designed for this sort of usage and is much more likly to return duplicate values. You will need to do the CRC or Hash check in one query, then have an outer query that compares the strings. You will also want to watch the QEP generated to make sure the optimiser is processing the query in the order you intended. It might decide to do the string comparisons first, then the CRC or Hash checks second.
As someone else has pointed out, this is only any good if you are doing an exact match. A hash can't help if you are trying to do any sort of range or partial match.
Are you doing an equality match, or a containment match? For an equality match, you should let the db handle this (but add a non-clustered index) and just test via
WHERE table.Foo = @foo
. For a containment match, you should perhaps look at full text index.