How to calculate the MD5 hash of a large file in C

2019-01-30 20:57发布

问题:

I am writing in C using OpenSSL library.

How can I calculate hash of a large file using md5?

As I know, I need to load a whole file to RAM as char array and then call the hash function. But what if the file is about 4Gb long? Sounds like a bad idea.

SOLVED: Thanks to askovpen, I found my bug. I've used

while ((bytes = fread (data, 1, 1024, inFile)) != 0)
    MD5_Update (&mdContext, data, 1024);

not

while ((bytes = fread (data, 1, 1024, inFile)) != 0)
    MD5_Update (&mdContext, data, bytes);

回答1:

example

gcc -g -Wall -o file file.c -lssl -lcrypto

#include <stdio.h>
#include <openssl/md5.h>

int main()
{
    unsigned char c[MD5_DIGEST_LENGTH];
    char *filename="file.c";
    int i;
    FILE *inFile = fopen (filename, "rb");
    MD5_CTX mdContext;
    int bytes;
    unsigned char data[1024];

    if (inFile == NULL) {
        printf ("%s can't be opened.\n", filename);
        return 0;
    }

    MD5_Init (&mdContext);
    while ((bytes = fread (data, 1, 1024, inFile)) != 0)
        MD5_Update (&mdContext, data, bytes);
    MD5_Final (c,&mdContext);
    for(i = 0; i < MD5_DIGEST_LENGTH; i++) printf("%02x", c[i]);
    printf (" %s\n", filename);
    fclose (inFile);
    return 0;
}

result:

$ md5sum file.c
25a904b0e512ee546b3f47574703d9fc  file.c
$ ./file
25a904b0e512ee546b3f47574703d9fc file.c


回答2:

First, MD5 is a hashing algorithm. It doesn't encrypt anything.

Anyway, you can read the file in chunks of whatever size you like. Call MD5_Init once, then call MD5_Update with each chunk of data you read from the file. When you're done, call MD5_Final to get the result.



回答3:

You don't have to load the entire file in memory at once. You can use the functions MD5_Init(), MD5_Update() and MD5_Final() to process it in chunks to produce the hash. If you are worried about making it an "atomic" operation, it may be necessary to lock the file to prevent someone else changing it during the operation.



回答4:

The top answer is correct, but didn't mention something: The value of the hash will be different for each buffer size used. The value will be consistent across hashes, so the same buffer size will produce the same hash everytime, however if this hash will be compared against a hash of the same data at a later time, the same buffer size must be used for each call.

In addition, if you want to make sure your digest code functions correctly, and go online to compare your hash with the online hashing websites, it appears they use a buffer length of 1. This also brings an interesting thought: It is perfectly acceptable to use a buffer length of 1 to hash a large file, it will just take longer (duh).

So my rule of thumb is if it's only for internal use, then I can set the buffer length accordingly for a large file, but if it has to play nice with other systems, then set the buffer length to 1 and deal with the time consequence.

int hashTargetFile(FILE* fp, unsigned char** md_value, int *md_len) {

    #define FILE_BUFFER_LENGTH 1

    EVP_MD_CTX *mdctx;
    const EVP_MD *md;
    int diglen; //digest length
    int arrlen = sizeof(char)*EVP_MAX_MD_SIZE + 1;
    int arrlen2 = sizeof(char)*FILE_BUFFER_LENGTH + 1;
    unsigned char *digest_value = (char*)malloc(arrlen);
    char *data = (char*)malloc(arrlen2);
    size_t bytes; //# of bytes read from file

    mdctx = EVP_MD_CTX_new();
    md = EVP_sha512();

    if (!mdctx) {
        fprintf(stderr, "Error while creating digest context.\n");
        return 0;
    }

    if (!EVP_DigestInit_ex(mdctx, md, NULL)) {
        fprintf(stderr, "Error while initializing digest context.\n");
        return 0;
    }

    while (bytes = fread(data, 1, FILE_BUFFER_LENGTH, fp) != 0) {
        if (!EVP_DigestUpdate(mdctx, data, bytes)) {
            fprintf(stderr, "Error while digesting file.\n");
            return 0;
        }
    }

    if (!EVP_DigestFinal_ex(mdctx, digest_value, &diglen)) {
        fprintf(stderr, "Error while finalizing digest.\n");
        return 0;
    }

    *md_value = digest_value;
    *md_len = diglen;

    EVP_MD_CTX_free(mdctx);

    return 1;
}


标签: c hash md5