I have a method that creates a MessageDigest (a hash) from a file, and I need to do this to a lot of files (>= 100,000). How big should I make the buffer used to read from the files to maximize performance?
Most everyone is familiar with the basic code (which I'll repeat here just in case):
MessageDigest md = MessageDigest.getInstance( "SHA" );
FileInputStream ios = new FileInputStream( "myfile.bmp" );
byte[] buffer = new byte[4 * 1024]; // what should this value be?
int read = 0;
while( ( read = ios.read( buffer ) ) > 0 )
md.update( buffer, 0, read );
ios.close();
md.digest();
What is the ideal size of the buffer to maximize throughput? I know this is system dependent, and I'm pretty sure its OS, FileSystem, and HDD dependent, and there maybe other hardware/software in the mix.
(I should point out that I'm somewhat new to Java, so this may just be some Java API call I don't know about.)
Edit: I do not know ahead of time the kinds of systems this will be used on, so I can't assume a whole lot. (I'm using Java for that reason.)
Edit: The code above is missing things like try..catch to make the post smaller
You could use the BufferedStreams/readers and then use their buffer sizes.
I believe the BufferedXStreams are using 8192 as the buffer size, but like Ovidiu said, you should probably run a test on a whole bunch of options. Its really going to depend on the filesystem and disk configurations as to what the best sizes are.
Yes, it's probably dependent on various things - but I doubt it will make very much difference. I tend to opt for 16K or 32K as a good balance between memory usage and performance.
Note that you should have a try/finally block in the code to make sure the stream is closed even if an exception is thrown.
As already mentioned in other answers, use BufferedInputStreams.
After that, I guess the buffer size does not really matter. Either the program is I/O bound, and growing buffer size over BIS default, will not make any big impact on performance.
Or the program is CPU bound inside the MessageDigest.update(), and majority of the time is not spent in the application code, so tweaking it will not help.
(Hmm... with multiple cores, threads might help.)
In BufferedInputStream‘s source you will find: private static int DEFAULT_BUFFER_SIZE = 8192;
So it's okey for you to use that default value.
But if you can figure out some more information you will get more valueable answers.
For example, your adsl maybe preffer a buffer of 1454 bytes, thats because TCP/IP's payload. For disks, you may use a value that match your disk's block size.
Reading files using Java NIO's FileChannel and MappedByteBuffer will most likely result in a solution that will be much faster than any solution involving FileInputStream. Basically, memory-map large files, and use direct buffers for small ones.
In the ideal case we should have enough memory to read the file in one read operation. That would be the best performer because we let the system manage File System , allocation units and HDD at will. In practice you are fortunate to know the file sizes in advance, just use the average file size rounded up to 4K (default allocation unit on NTFS). And best of all : create a benchmark to test multiple options.