In my MySQL DB I have a table defined like:
CREATE TABLE `mytablex_cs` (
`id` mediumint(8) unsigned NOT NULL AUTO_INCREMENT,
`tag` varchar(6) COLLATE utf8_bin NOT NULL DEFAULT '',
`value` text COLLATE utf8_bin NOT NULL,
PRIMARY KEY (`id`),
KEY `kt` (`tag`),
KEY `kv` (`value`(200))
) ENGINE=MyISAM AUTO_INCREMENT=7 DEFAULT CHARSET=utf8 COLLATE=utf8_bin
I need to implement a UNIQUE constraint (key) on the value
field.
I know that is not yet possible to define a unique index on the entire value for a blob
or text
field, but there is a ticket(?) open to implement such feature (see this page) where it has been suggested to create a unique key using a hash like it is already implemented for other fields.
Now I would like to use a similar approach adding to the table another field that will contain the hash and creating a unique key on this field.
I gave a look to possible ways to create this hash and, since I would like to avoid collisions (I need to insert several millions of entries), it seems that the RIPEMD-160 algorithm is the best one, even if a quick search gave me several similar solutions that use SHA256 or even SHA1 and MD5.
I totally lack of knowledge in cryptography, so what are the down sides of choosing this approach?
Another question I have is: which algorithm is currently used by MySQL to create the hash?
Lets look at your requirements:
You need to ensure that a value field is unique. The value field is a text column and due to the nature of it there is no way to create a unique index on the value field(for now). So using a extra field which is the hash of the field value is your only real option here.
Advantages to this approach:
- Easy to calculate the hash.
- It is extremely rare to create a duplicate hash for two different values so your hash values are almost gauranteed to be unqiue.
- Hashes are normally some numeric value(expressed as hexdecimal) that can be efficiently indexed.
- The hashes wont take up a lot of space, different hashing function return different length hashes so play around with the different algorithms and test them to find one that suits your need.
Disadvantages of this approach:
- Extra field to cater for during INSERTS and UPDATES i.e. there is more work to do.
- If you already have data in the table and this is in production you will have to update the current data and hopefully you dont have duplicates already. Also it will take time to run the update. Thus it might be tricky to apply the change in a already working system.
- Hashing functions are CPU intensive and can have a negative impact on CPU usage.
I assume you understand what a hash function does and conceptually how it works.
You can find a list of cryptographic functions here: http://dev.mysql.com/doc/refman/5.5/en//encryption-functions.html
MySQL supports as far as I know MD5, SHA, SHA1 and SHA2 hashing functions. Most if not all of these should be sufficient for just hashing. Some functions like MD5 has some issues when used in cryptography applications i.e. when using it in PKI as a signature algorithm etc. However these issues should not be that important when you decide on using it to create a unique value as it is not really being applied in a cryptography context here.
To use the MySQL hashing functions you can try the following examples:
SELECT MD5('1234')
SELECT SHA('1234')
SELECT SHA1('1234')
SELECT SHA2('1234',224);
As with everythig new you should try all the approaches and find the one that will be most successfull in your case.