I have an application that needs to transfer files to a service like S3.
I have an InputStream
of that incoming file (not necessarily a FileInputStream
), and I write this InputStream
to a multipart request body that is represented by an OutputStream
, and then I need to write the hash of the file at the end (also through the request body).
Thanks to the DigestInputStream
, I'm able to compute the hash live, so after the file body has been sent to the OutputStream
, the hash becomes available and can also be appended to the multipart request.
You can check this related question: What is the less expensive hash algorithm?
And particularly my own benchmark answer: https://stackoverflow.com/a/19160508/82609
So it seems my own computer is capable of hashing with a MessageDigest
with a throughput of 500MB/s for MD5, and nearly 200MB/s for SHA-512.
The connection to which I write the request body has a throughput of 100MB/s. If I write to the OutputStream with a higher throughput, the OutputStream starts to block (this is done intentionnally because we do want to keep a low memory footprint and do not want bytes to accumulate in some part of the application)
I have done tests and I can clearly notice the impact of the algorithm on the performances of my application.
I tried to upload 20 files of 50MB (1Gb total).
- With MD5, it takes ~16sec
- With SHA-512, it takes ~22sec
When doing a single upload, I can also see a slowdown of the same order.
So in the end there is no parallelisation of the computation of the hash and the write to the connection: these steps are done sequentially:
- Request bytes from the stream
- Hashing the bytes requested
- Sending the bytes
So as the hashing has a throughput > the connection throughput, is there an easy way to not have that slowdown? Does it require additional threads?
I think the next chunk of data could be precomputed and hashed during the previous chunk is being written to the connection right?
This is not a premature optimization, we need to upload a lot of documents and the execution time is sensible for our business.