I've got a 40MB file in the disk and I need to "map" it into memory using a byte array.
At first, I thought writing the file to a ByteArrayOutputStream would be the best way, but I find it takes about 160MB of heap space at some moment during the copy operation.
Does somebody know a better way to do this without using three times the file size of RAM?
Update: Thanks for your answers. I noticed I could reduce memory consumption a little telling ByteArrayOutputStream initial size to be a bit greater than the original file size (using the exact size with my code forces reallocation, got to check why).
There's another high memory spot: when I get byte[] back with ByteArrayOutputStream.toByteArray. Taking a look to its source code, I can see it is cloning the array:
public synchronized byte toByteArray()[] {
return Arrays.copyOf(buf, count);
}
I'm thinking I could just extend ByteArrayOutputStream and rewrite this method, so to return the original array directly. Is there any potential danger here, given the stream and the byte array won't be used more than once?
For an explanation of the buffer growth behavior of
ByteArrayOutputStream
, please read this answer.In answer to your question, it is safe to extend
ByteArrayOutputStream
. In your situation, it is probably better to override the write methods such that the maximum additional allocation is limited, say, to 16MB. You should not override thetoByteArray
to expose the protected buf[] member. This is because a stream is not a buffer; A stream is a buffer that has a position pointer and boundary protection. So, it is dangerous to access and potentially manipulate the buffer from outside the class.Google Guava ByteSource seems to be a good choice for buffering in memory. Unlike implementations like
ByteArrayOutputStream
orByteArrayList
(from Colt Library) it does not merge the data into a huge byte array but stores every chunk separately. An example:The
ByteSource
can be read as anInputStream
anytime later:If you have 40 MB of data I don't see any reason why it would take more than 40 MB to create a byte[]. I assume you are using a growing ByteArrayOutputStream which creates a byte[] copy when finished.
You can try the old read the file at once approach.
Using a MappedByteBuffer is more efficient and avoids a copy of data (or using the heap much) provided you can use the ByteBuffer directly, however if you have to use a byte[] its unlikely to help much.
If you really want to map the file into memory, then a
FileChannel
is the appropriate mechanism.If all you want to do is read the file into a simple
byte[]
(and don't need changes to that array to be reflected back to the file), then simply reading into an appropriately-sizedbyte[]
from a normalFileInputStream
should suffice.Guava has
Files.toByteArray()
which does all that for you.You shouldn't change the specified behavior of the existing method, but it's perfectly fine to add a new method. Here's an implementation:
An alternative but hackish way of getting the buffer from any ByteArrayOutputStream is to use the fact that its
writeTo(OutputStream)
method passes the buffer directly to the provided OutputStream:(That works, but I'm not sure if it's useful, given that subclassing ByteArrayOutputStream is simpler.)
However, from the rest of your question it sounds like all you want is a plain
byte[]
of the complete contents of the file. As of Java 7, the simplest and fastest way to do that is callFiles.readAllBytes
. In Java 6 and below, you can useDataInputStream.readFully
, as in Peter Lawrey's answer. Either way, you will get an array that is allocated once at the correct size, without the repeated reallocation of ByteArrayOutputStream.ByteArrayOutputStream
should be okay so long as you specify an appropriate size in the constructor. It will still create a copy when you calltoByteArray
, but that's only temporary. Do you really mind the memory briefly going up a lot?Alternatively, if you already know the size to start with you can just create a byte array and repeatedly read from a
FileInputStream
into that buffer until you've got all the data.