Which algorithm for hashing name, firstName and bi

2019-05-20 23:33发布

问题:

I have to save the combination of lastname, firstname and birth-date of a person as a hash. This hash is later used to search for the same person with the exactly same properties. My question is, if SHA-1 is a meaningfull algorithm for this.

As far as I understand SHA-1, there is virtually no possibility that two different persons (with different attributes) will ever get the same hash-value. Is this right?

回答1:

If you want to search for a person knowing only those credentials, you could store the SHA-1 in the database(or MD5 for speed, unless you have like a quadrillion people to sample).

The hash will be worthless, as it stores no information about the person, but it can work for searching a database. You just want to make sure that the three pieces of information match, so it would be safe to just concatenate them:

user.hash = SHA1(user.firstName + user.DOB + user.lastName)

And when you query, you could check if the two match:

hash = SHA1(query.firstName + query.DOB + query.lastName)

for user in database:
  if user.hash == hash:
    return user

I put query.DOB in the middle because the first and last name might collide, like if JohnDoe Bob was born on the same day as John DoeBob. I'm not aware of numeric names, so I think this will stop collisions like those ;)

But if this is a big database, I'd try MD5. It's faster, but there is a chance of a collision (in your case, I can guarantee that one won't occur). The chance of a collision, however, is really small.

To put that into perspective, a collision is a 1 / 2^128 occurrence, which is:

                          1
---------------------------------------------------
340,282,366,920,938,463,463,374,607,431,768,211,456

And that's a little smaller than:

0.0000000000000000000000000000000000000293873 %

I'm pretty sure you're not going to get a collision ;)



回答2:

Hash collisions are inevitable. However small can be the chance of the collision, you shouldn't really rely only on hash if you really want 100% identification.

If you use hashing to speed up database search, there is no need to use SHA256. Use whatever hash function your system has with the smallest size (MD5() for MySQL or you might even try CRC32, if your database is not-so-big). Just when you query table, you need to provide all conditions you are searching by:

SELECT * from user WHERE hash="AABBCCDD" AND firstname="Pavel" AND surname="Sokolov"

Databases maintain a value, that is called index cardinality. It's a measure of uniqueness of the data on the given index. So, you can index fields you want together with hash field and database will choose the most selective index for the query himself. Adding additional conditions doesn't affect performance negatively because most database can use only one index when selecting data from the table and they will select the one with the most cardinality value.

The database will need to first select all rows matches the index and then scan through them to discard rows that doesn't match other conditions.

If you cannot use the method I described, well, I think even MD5 collision probability is very low to occur on database of people names.

P.S. I hope you know, that you know that "the combination of lastname, firstname and birth-date of a person" is not enough to 100% identify a human? And sooner this combination will match than some hashes collide.



回答3:

If you are concerned with collisions there is a good discussion here:

Understanding sha-1 collision weakness

If you have security concerns, I would consider SHA-256 instead.