Java: Memory efficient ByteArrayOutputStream

2020-05-23 10:37发布

站内文章 / Java

118 0

啃猪蹄的小仙女

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I've got a 40MB file in the disk and I need to "map" it into memory using a byte array.

At first, I thought writing the file to a ByteArrayOutputStream would be the best way, but I find it takes about 160MB of heap space at some moment during the copy operation.

Does somebody know a better way to do this without using three times the file size of RAM?

Update: Thanks for your answers. I noticed I could reduce memory consumption a little telling ByteArrayOutputStream initial size to be a bit greater than the original file size (using the exact size with my code forces reallocation, got to check why).

There's another high memory spot: when I get byte[] back with ByteArrayOutputStream.toByteArray. Taking a look to its source code, I can see it is cloning the array:

public synchronized byte toByteArray()[] {
    return Arrays.copyOf(buf, count);
}

I'm thinking I could just extend ByteArrayOutputStream and rewrite this method, so to return the original array directly. Is there any potential danger here, given the stream and the byte array won't be used more than once?

回答1:

MappedByteBuffer might be what you're looking for.

I'm surprised it takes so much RAM to read a file in memory, though. Have you constructed the ByteArrayOutputStream with an appropriate capacity? If you haven't, the stream could allocate a new byte array when it's near the end of the 40 MB, meaning that you would, for example, have a full buffer of 39MB, and a new buffer of twice the size. Whereas if the stream has the appropriate capacity, there won't be any reallocation (faster), and no wasted memory.

回答2:

ByteArrayOutputStream should be okay so long as you specify an appropriate size in the constructor. It will still create a copy when you call toByteArray, but that's only temporary. Do you really mind the memory briefly going up a lot?

Alternatively, if you already know the size to start with you can just create a byte array and repeatedly read from a FileInputStream into that buffer until you've got all the data.

回答3:

If you really want to map the file into memory, then a FileChannel is the appropriate mechanism.

If all you want to do is read the file into a simple byte[] (and don't need changes to that array to be reflected back to the file), then simply reading into an appropriately-sized byte[] from a normal FileInputStream should suffice.

Guava has Files.toByteArray() which does all that for you.

回答4:

For an explanation of the buffer growth behavior of ByteArrayOutputStream, please read this answer.

In answer to your question, it is safe to extend ByteArrayOutputStream. In your situation, it is probably better to override the write methods such that the maximum additional allocation is limited, say, to 16MB. You should not override the toByteArray to expose the protected buf[] member. This is because a stream is not a buffer; A stream is a buffer that has a position pointer and boundary protection. So, it is dangerous to access and potentially manipulate the buffer from outside the class.

回答5:

If you have 40 MB of data I don't see any reason why it would take more than 40 MB to create a byte[]. I assume you are using a growing ByteArrayOutputStream which creates a byte[] copy when finished.

You can try the old read the file at once approach.

File file = 
DataInputStream is = new DataInputStream(FileInputStream(file));
byte[] bytes = new byte[(int) file.length()];
is.readFully(bytes);
is.close();

Using a MappedByteBuffer is more efficient and avoids a copy of data (or using the heap much) provided you can use the ByteBuffer directly, however if you have to use a byte[] its unlikely to help much.

回答6:

... but I find it takes about 160MB of heap space at some moment during the copy operation

I find this extremely surprising ... to the extent that I have my doubts that you are measuring the heap usage correctly.

Let's assume that your code is something like this:

BufferedInputStream bis = new BufferedInputStream(
        new FileInputStream("somefile"));
ByteArrayOutputStream baos = new ByteArrayOutputStream();  /* no hint !! */

int b;
while ((b = bis.read()) != -1) {
    baos.write((byte) b);
}
byte[] stuff = baos.toByteArray();

Now the way that a ByteArrayOutputStream manages its buffer is to allocate an initial size, and (at least) double the buffer when it fills it up. Thus, in the worst case baos might use up to 80Mb buffer to hold a 40Mb file.

The final step allocates a new array of exactly baos.size() bytes to hold the buffer's contents. That's 40Mb. So the peak amount of memory that is actually in use should be 120Mb.

So where are those extra 40Mb being used? My guess is that they are not, and that you are actually reporting the total heap size, not the amount of memory that is occupied by reachable objects.

So what is the solution?

You could use a memory mapped buffer.
You could give a size hint when you allocate the ByteArrayOutputStream; e.g.
```
 ByteArrayOutputStream baos = ByteArrayOutputStream(file.size());
```

You could dispense with the ByteArrayOutputStream entirely and read directly into a byte array.

 byte[] buffer = new byte[file.size()];
 FileInputStream fis = new FileInputStream(file);
 int nosRead = fis.read(buffer);
 /* check that nosRead == buffer.length and repeat if necessary */

Both options 1 and 2 should have an peak memory usage of 40Mb while reading a 40Mb file; i.e. no wasted space.

It would be helpful if you posted your code, and described your methodology for measuring memory usage.

I'm thinking I could just extend ByteArrayOutputStream and rewrite this method, so to return the original array directly. Is there any potential danger here, given the stream and the byte array won't be used more than once?

The potential danger is that your assumptions are incorrect, or become incorrect due to someone else modifying your code unwittingly ...

回答7:

Google Guava ByteSource seems to be a good choice for buffering in memory. Unlike implementations like ByteArrayOutputStream or ByteArrayList(from Colt Library) it does not merge the data into a huge byte array but stores every chunk separately. An example:

List<ByteSource> result = new ArrayList<>();
try (InputStream source = httpRequest.getInputStream()) {
    byte[] cbuf = new byte[CHUNK_SIZE];
    while (true) {
        int read = source.read(cbuf);
        if (read == -1) {
            break;
        } else {
            result.add(ByteSource.wrap(Arrays.copyOf(cbuf, read)));
        }
    }
}
ByteSource body = ByteSource.concat(result);

The ByteSource can be read as an InputStream anytime later:

InputStream data = body.openBufferedStream();

回答8:

I'm thinking I could just extend ByteArrayOutputStream and rewrite this method, so to return the original array directly. Is there any potential danger here, given the stream and the byte array won't be used more than once?

You shouldn't change the specified behavior of the existing method, but it's perfectly fine to add a new method. Here's an implementation:

/** Subclasses ByteArrayOutputStream to give access to the internal raw buffer. */
public class ByteArrayOutputStream2 extends java.io.ByteArrayOutputStream {
    public ByteArrayOutputStream2() { super(); }
    public ByteArrayOutputStream2(int size) { super(size); }

    /** Returns the internal buffer of this ByteArrayOutputStream, without copying. */
    public synchronized byte[] buf() {
        return this.buf;
    }
}

An alternative but hackish way of getting the buffer from any ByteArrayOutputStream is to use the fact that its writeTo(OutputStream) method passes the buffer directly to the provided OutputStream:

/**
 * Returns the internal raw buffer of a ByteArrayOutputStream, without copying.
 */
public static byte[] getBuffer(ByteArrayOutputStream bout) {
    final byte[][] result = new byte[1][];
    try {
        bout.writeTo(new OutputStream() {
            @Override
            public void write(byte[] buf, int offset, int length) {
                result[0] = buf;
            }

            @Override
            public void write(int b) {}
        });
    } catch (IOException e) {
        throw new RuntimeException(e);
    }
    return result[0];
}

(That works, but I'm not sure if it's useful, given that subclassing ByteArrayOutputStream is simpler.)

However, from the rest of your question it sounds like all you want is a plain byte[] of the complete contents of the file. As of Java 7, the simplest and fastest way to do that is call Files.readAllBytes. In Java 6 and below, you can use DataInputStream.readFully, as in Peter Lawrey's answer. Either way, you will get an array that is allocated once at the correct size, without the repeated reallocation of ByteArrayOutputStream.

回答9:

... came here with the same observation when reading a 1GB file: Oracle's ByteArrayOutputStream has a lazy memory management. A byte-Array is indexed by an int and such anyway limited to 2GB. Without dependency on 3rd-party you might find this useful:

static public byte[] getBinFileContent(String aFile) 
{
    try
    {
        final int bufLen = 32768;
        final long fs = new File(aFile).length();
        final long maxInt = ((long) 1 << 31) - 1;
        if (fs > maxInt)
        {
            System.err.println("file size out of range");
            return null;
        }
        final byte[] res = new byte[(int) fs];
        final byte[] buffer = new byte[bufLen];
        final InputStream is = new FileInputStream(aFile);
        int n;
        int pos = 0;
        while ((n = is.read(buffer)) > 0)
        {
            System.arraycopy(buffer, 0, res, pos, n);
            pos += n;
        }
        is.close();
        return res;
    }
    catch (final IOException e)
    {
        e.printStackTrace();
        return null;
    }
    catch (final OutOfMemoryError e)
    {
        e.printStackTrace();
        return null;
    }
}