I'm using iTextSharp to read the text from a PDF file. However, there are times I cannot extract text, because the PDF file is only containing images. I download the same PDF files everyday, and I want to see if the PDF has been modified. If the text and modification date cannot be obtained, is a MD5 checksum the most reliable way to tell if the file has changed?
If it is, some code samples would be appreciated, because I don't have much experience with cryptography.
I know this question was already answered, but this is what I use:
Where GetHash:
Probably not the best way, but it can be handy.
It's very simple using System.Security.Cryptography.MD5:
(I believe that actually the MD5 implementation used doesn't need to be disposed, but I'd probably still do so anyway.)
How you compare the results afterwards is up to you; you can convert the byte array to base64 for example, or compare the bytes directly. (Just be aware that arrays don't override
Equals
. Using base64 is simpler to get right, but slightly less efficient if you're really only interested in comparing the hashes.)If you need to represent the hash as a string, you could convert it to hex using
BitConverter
:This is how I do it:
Here is a slightly simpler version that I found. It reads the entire file in one go and only requires a single
using
directive.And if you need to calculate the MD5 to see whether it matches the MD5 of an Azure blob, then this SO question and answer might be helpful: MD5 hash of blob uploaded on Azure doesnt match with same file on local machine