Previously I asked a question about combining SHA1+MD5 but after that I understand calculating SHA1 and then MD5 of a lagrge file is not that faster than SHA256. In my case a 4.6 GB file takes about 10 mins with the default implementation SHA256 with (C# MONO) in a Linux system.
public static string GetChecksum(string file)
{
using (FileStream stream = File.OpenRead(file))
{
var sha = new SHA256Managed();
byte[] checksum = sha.ComputeHash(stream);
return BitConverter.ToString(checksum).Replace("-", String.Empty);
}
}
Then I read this topic and somehow change my code according what they said to :
public static string GetChecksumBuffered(Stream stream)
{
using (var bufferedStream = new BufferedStream(stream, 1024 * 32))
{
var sha = new SHA256Managed();
byte[] checksum = sha.ComputeHash(bufferedStream);
return BitConverter.ToString(checksum).Replace("-", String.Empty);
}
}
But It doesn't have such a affection and takes about 9 mins.
Then I try to test my file through sha256sum
command in Linux for the same file and It takes about 28 secs and both the above code and Linux command give the same result !
Someone advised me to read about differences between Hash Code and Checksum and I reach to this topic that explains the differences.
My Questions are :
What causes such different between the above code and Linux
sha256sum
in time ?What does the above code do ? (I mean is it the hash code calculation or checksum calculation? Because if you search about give a hash code of a file and checksum of a file in C#, they both reach to the above code.)
Is there any motivated attack against
sha256sum
even when SHA256 is collision resistant ?How can I make my implementation as fast as
sha256sum
in C#?
Best guess is that there's some additional buffering in the mono implementation of the File.Read operation. Having recently looked into checksums on a large file it would seem that on a decent spec windows machine you should expect roughly 6 seconds per Gb if all is running smoothly. Oddly it has been reported in more than one benchmark test that SHA-512 is noticeably quicker than SHA-256 (see 3 below). One other possibility is that the problem is not in allocating the data, but in disposing of the bytes once read. You may be able to use TransformBlock (and TransformFinalBlock) on a single array rather than reading the stream in one big gulp - I have no idea if this will work but it bears investigation.
The difference between hashcode and checksum is (nearly) semantics. They both calculate a shorter 'magic' number that is fairly unique to the data in the input, though if you have 4.6GB of input and 64B of output 'fairly' is somewhat limited. A checksum is not secure, and with a bit of work you can figure out the input from enough outputs, work backwards from output to input and all sorts of insecure things. A Cryptographic hash takes longer to calculate, but changing just one bit in the input will radically change the output and for a good hash (e.g. SHA-512) there's no known way of getting from output back to input.
MD5 is breakable, you can fabricate an input to produce any given output, if needed, on a PC. SHA256 is (probably) still secure, but won't be in a few years time - if your project has a lifespan measured in decades then assume you'll need to change it. SHA512 has no known attacks, and probably won't for quite a while, and since it's quicker than SHA256 I'd recommend it anyway. Benchmarks show it takes about 3 times longer to calculate SHA512 than MD5, so if your speed issue can be dealt with it's the way to go.
No idea, beyond those mentioned above, you're doing it right.
For a bit of light reading https://crypto.stackexchange.com/questions/26336/sha512-faster-than-sha256
Edit in response to question in comment
The purpose of a checksum is to allow you to check if a file has changed between the time you originally wrote it, and the time you come to use it. It does this by producing a small value, 512 bits in the case of SHA512, where every bit of the original file contributes at least something to the output value. The purpose of a hashcode is the same, with the addition that it is really, really difficult for anyone else to get the same output value by making carefully managed changes to the file. The premise is that if the checksums are the same at the start and when you check it then the files are the same, and if they're different the file has certainly changed. What you are doing above is feeding the file, in its entirety, through an algorthm that rolls, folds and spindles the bits it reads to produce the small value.
As an example, in the application I'm currently writing I need to know if parts of a file of any size have changed, so I split the file into 16K blocks, take the SHA-512 hash of each block and store it in a separate database on another drive. When I come to see if the file has changed I reproduce the hash for each block and compare it to the original. Since I'm using SHA-512 the chances of a file changing but the hash staying the same are unimaginably small, so I can be confident of detecting changes in 100s of GB of data whilst only storing a few MB of hashes in my database. I'm copying the file at the same time as taking the hash, and the process is entirely disk-bound; it takes about 5 minutes to transfer a file to a USB drive of which 10 seconds is probably related to hashing.
Lack of disk space to store hashes is a problem I can't solve in a post ... buy a usb stick?