I have an API route that proxies a file upload from the browser/client to AWS S3.
This API route attempts to stream the file as it is uploaded to avoid buffering the entire contents of the file in memory on the server.
However, the route also attempts to calculate an MD5 checksum of the file's body. As each part of the file is chunked, the hash.update()
method is invoked w/ the chunk.
http://nodejs.org/api/crypto.html#crypto_hash_update_data_input_encoding
var crypto = require('crypto');
var hash = crypto.createHash('md5');
function write (chunk) {
// invoked many times as file is uploaded
hash.update(chunk);
}
function done() {
// will hash buffer all chunks in memory at this point?
hash.digest('hex');
}
Will the instance of Hash buffer all the contents of the file in order to perform the hash calculation (thus defeating the goal of avoiding buffering the entire file's contents in memory)? Or can an MD5 hash be calculated incrementally, without ever having the entire input available to perform the calculation?
MD5 and some other hash functions are based on the Merkle–Damgård construction. It supports the incremental/progressive/streaming hashing of data. After the data is transformed into an internal state (which has a fixed size) a last finalization step is performed to generate the final hash by padding and processing the last block and afterwards by simply returning the final state.
This is probably also why many hashing library functions are designed in such a way with an update and a finalization step.
To answer your question: No, the file content is not kept in a buffer, but is rather transformed into a fixed size internal state.
All modern cryptographic hash functions are created in such a way that they can be updated incrementally.
To allow for incremental updates, the input data of the message is first arranged in blocks. These blocks are processed in order. To do this the implementation usually buffers the input internally until it has a full block, and then processes this block together with the current state to produce a new state, using a so called compression function. The initial state usually simply consists of predetermined constant values. During the call to digest
the last block is padded - usually with bit padding and an encoding of the amount of processed bytes - and the final state is calculated; this may require an additional block without any message data. A final operation may be performed and finally the resulting hash value is returned.
For MD5 the Merkle–Damgård construction is used. This common construction is also used for SHA-1 and SHA-2. SHA-2 is a family of hashes based on the algorithms for SHA-256 (SHA-224) and SHA-512 (SHA-384, SHA-512/224 and SHA-512/256). MD5 in particular uses a block size of 512 bits and a internal state of 128 bits. The internal state of the last block (including padding) is simply output directly without any post-processing for MD5, SHA-1, SHA-256 and SHA-512.
Keccak has been chosen to be SHA-3. It is construction based on a sponge, a specific compression function. It isn't a Merkle–Damgård hash - which is a big reason why it has been chosen as SHA-3. It still has all the update properties of Merkle–Damgård hashes and has been designed to be compatible with SHA-2. It splits up and buffers blocks just like the previously mentioned hashes, but it has a larger internal state and performs final operations on the output, making it arguably more secure.
So when you were using a modern hash construction such as MD5 you were unknowingly performing additional buffering. Fortunately, the buffering of a single block of 512 bits + 128 bits for the state size will not likely make you run out of memory. It is certainly not required for the hash implementation to buffer the entire message before the final hash value can be calculated.
Notes:
- MD5 and SHA-1 are considered insecure w.r.t. collision resistance and they should preferably not be used anymore, especially when it comes to validating contents;
- A "compression function" is a specific cryptographic notion; it is not
LSZIP or anything similar;
- There may be specialized, theoretical hashes that perform the calculate the values differently - theoretically speaking there is no requirement to split the input messages into blocks and operate on the blocks sequentially. No worry, those are unlikely to be in the libraries you are using;
- Similarly, implementations may decide to buffer more blocks at once, but that is fortunately extremely uncommon as well. Commonly only one block is used as buffer - in some cases it could be more performant to buffer a few blocks instead;
- Some low level implementations may require you to supply the blocks yourself for reasons of efficiency.