DigestInputStream -> compute the hash without slow

2019-09-05 05:24发布

问题:

I have an application that needs to transfer files to a service like S3.

I have an InputStream of that incoming file (not necessarily a FileInputStream), and I write this InputStream to a multipart request body that is represented by an OutputStream, and then I need to write the hash of the file at the end (also through the request body).

Thanks to the DigestInputStream, I'm able to compute the hash live, so after the file body has been sent to the OutputStream, the hash becomes available and can also be appended to the multipart request.

You can check this related question: What is the less expensive hash algorithm?

And particularly my own benchmark answer: https://stackoverflow.com/a/19160508/82609

So it seems my own computer is capable of hashing with a MessageDigest with a throughput of 500MB/s for MD5, and nearly 200MB/s for SHA-512.

The connection to which I write the request body has a throughput of 100MB/s. If I write to the OutputStream with a higher throughput, the OutputStream starts to block (this is done intentionnally because we do want to keep a low memory footprint and do not want bytes to accumulate in some part of the application)


I have done tests and I can clearly notice the impact of the algorithm on the performances of my application.

I tried to upload 20 files of 50MB (1Gb total).

  • With MD5, it takes ~16sec
  • With SHA-512, it takes ~22sec

When doing a single upload, I can also see a slowdown of the same order.

So in the end there is no parallelisation of the computation of the hash and the write to the connection: these steps are done sequentially:

  • Request bytes from the stream
  • Hashing the bytes requested
  • Sending the bytes

So as the hashing has a throughput > the connection throughput, is there an easy way to not have that slowdown? Does it require additional threads?

I think the next chunk of data could be precomputed and hashed during the previous chunk is being written to the connection right?

This is not a premature optimization, we need to upload a lot of documents and the execution time is sensible for our business.