MD5 and SHA-1 hashes have weaknesses against collision attacks. SHA256 does not but it outputs 256 bits. Can I safely take the first or last 128 bits and use that as the hash? I know it will be weaker (because it has less bits) but otherwise will it work?
Basically I want to use this to uniquely identify files in a file system that might one day contain a trillion files. I'm aware of the birthday problem and a 128 bit hash should yield about a 1 in a trillion chance on a trillion files that there would be two different files with the same hash. I can live with those odds.
What I can't live with is if somebody could easily, deliberately, insert a new file with the same hash and the same beginning characters of the file. I believe in MD5 and SHA1 this is possible.
Yeah that will work. Theoretically it's better to XOR the two halves together but even truncated SHA256 is stronger than MD5. You should still consider the result a 128 bit hash rather than a 256 bit hash though.
My particular recommendation in this particular case is to store and reference using HASH + uniquifier where uniquifier is the count of how many distinct files you've seen with this hash before. This way you don't absolutely fall down flat if somebody tries to store future discovered collision vectors for SHA256.
Yes, that will work.
For the record, there are known in-use collision attacks against MD5, but the SHA-1 attacks are at this point completely theoretical (no SHA-1 collision has ever been found... yet).
But is it worth it? If you have a hash for each file, then you essentially have an overhead for each file. Let's say that each file must take up at least 512 bytes (a typical disk sector) and that you're storing these hashes compactly enough so as to not have each hash take up much more than the hash size.
So, even if all your files are 512 bytes, the smallest, you're talking either
16 / 512 = 3.1%
or32 / 512 = 6.3%
. In reality, I'd bet your average file size is higher (unless all your files are 1 sector...), so that overhead would be less.Now, the amount of space you need for hashes scales linearly with the number of files you have. Is that extra space worth that much? Even if you had your mentioned trillion files - that's
1 000 000 000 000 * 16 = ~29 TiB
, which is a lot of space, but keep in mind: your data would be1 000 000 000 000 * 512 = 465 TiB
. The numbers are worthless, really, since it's still3%
or6%
overhead. But at this level, where you have a half petabyte of storage, does 15 terabytes matter? At any level, does a3%
savings mean anything? And remember, if they're larger, you save less. (Which, they probably are: good luck getting a 512 byte sector size at that hard disk size.)So, is this
3%
or less disk savings worth the potential risk in security. (Which I'll leave unanswered, as it's waaay not my cup of tea.)Alternatively, could you, say, group files together in some logical fashion, so that you have less files? (I mean, if you have trillions of 512 byte files, do you really want to hash every byte on disk?)