Java: Memory efficient ByteArrayOutputStream

2020-05-23 09:59发布

I've got a 40MB file in the disk and I need to "map" it into memory using a byte array.

At first, I thought writing the file to a ByteArrayOutputStream would be the best way, but I find it takes about 160MB of heap space at some moment during the copy operation.

Does somebody know a better way to do this without using three times the file size of RAM?

Update: Thanks for your answers. I noticed I could reduce memory consumption a little telling ByteArrayOutputStream initial size to be a bit greater than the original file size (using the exact size with my code forces reallocation, got to check why).

There's another high memory spot: when I get byte[] back with ByteArrayOutputStream.toByteArray. Taking a look to its source code, I can see it is cloning the array:

public synchronized byte toByteArray()[] {
    return Arrays.copyOf(buf, count);
}

I'm thinking I could just extend ByteArrayOutputStream and rewrite this method, so to return the original array directly. Is there any potential danger here, given the stream and the byte array won't be used more than once?

9条回答
Evening l夕情丶
2楼-- · 2020-05-23 10:27

For an explanation of the buffer growth behavior of ByteArrayOutputStream, please read this answer.

In answer to your question, it is safe to extend ByteArrayOutputStream. In your situation, it is probably better to override the write methods such that the maximum additional allocation is limited, say, to 16MB. You should not override the toByteArray to expose the protected buf[] member. This is because a stream is not a buffer; A stream is a buffer that has a position pointer and boundary protection. So, it is dangerous to access and potentially manipulate the buffer from outside the class.

查看更多
狗以群分
3楼-- · 2020-05-23 10:28

Google Guava ByteSource seems to be a good choice for buffering in memory. Unlike implementations like ByteArrayOutputStream or ByteArrayList(from Colt Library) it does not merge the data into a huge byte array but stores every chunk separately. An example:

List<ByteSource> result = new ArrayList<>();
try (InputStream source = httpRequest.getInputStream()) {
    byte[] cbuf = new byte[CHUNK_SIZE];
    while (true) {
        int read = source.read(cbuf);
        if (read == -1) {
            break;
        } else {
            result.add(ByteSource.wrap(Arrays.copyOf(cbuf, read)));
        }
    }
}
ByteSource body = ByteSource.concat(result);

The ByteSource can be read as an InputStream anytime later:

InputStream data = body.openBufferedStream();
查看更多
混吃等死
4楼-- · 2020-05-23 10:30

If you have 40 MB of data I don't see any reason why it would take more than 40 MB to create a byte[]. I assume you are using a growing ByteArrayOutputStream which creates a byte[] copy when finished.

You can try the old read the file at once approach.

File file = 
DataInputStream is = new DataInputStream(FileInputStream(file));
byte[] bytes = new byte[(int) file.length()];
is.readFully(bytes);
is.close();

Using a MappedByteBuffer is more efficient and avoids a copy of data (or using the heap much) provided you can use the ByteBuffer directly, however if you have to use a byte[] its unlikely to help much.

查看更多
再贱就再见
5楼-- · 2020-05-23 10:37

If you really want to map the file into memory, then a FileChannel is the appropriate mechanism.

If all you want to do is read the file into a simple byte[] (and don't need changes to that array to be reflected back to the file), then simply reading into an appropriately-sized byte[] from a normal FileInputStream should suffice.

Guava has Files.toByteArray() which does all that for you.

查看更多
一纸荒年 Trace。
6楼-- · 2020-05-23 10:37

I'm thinking I could just extend ByteArrayOutputStream and rewrite this method, so to return the original array directly. Is there any potential danger here, given the stream and the byte array won't be used more than once?

You shouldn't change the specified behavior of the existing method, but it's perfectly fine to add a new method. Here's an implementation:

/** Subclasses ByteArrayOutputStream to give access to the internal raw buffer. */
public class ByteArrayOutputStream2 extends java.io.ByteArrayOutputStream {
    public ByteArrayOutputStream2() { super(); }
    public ByteArrayOutputStream2(int size) { super(size); }

    /** Returns the internal buffer of this ByteArrayOutputStream, without copying. */
    public synchronized byte[] buf() {
        return this.buf;
    }
}

An alternative but hackish way of getting the buffer from any ByteArrayOutputStream is to use the fact that its writeTo(OutputStream) method passes the buffer directly to the provided OutputStream:

/**
 * Returns the internal raw buffer of a ByteArrayOutputStream, without copying.
 */
public static byte[] getBuffer(ByteArrayOutputStream bout) {
    final byte[][] result = new byte[1][];
    try {
        bout.writeTo(new OutputStream() {
            @Override
            public void write(byte[] buf, int offset, int length) {
                result[0] = buf;
            }

            @Override
            public void write(int b) {}
        });
    } catch (IOException e) {
        throw new RuntimeException(e);
    }
    return result[0];
}

(That works, but I'm not sure if it's useful, given that subclassing ByteArrayOutputStream is simpler.)

However, from the rest of your question it sounds like all you want is a plain byte[] of the complete contents of the file. As of Java 7, the simplest and fastest way to do that is call Files.readAllBytes. In Java 6 and below, you can use DataInputStream.readFully, as in Peter Lawrey's answer. Either way, you will get an array that is allocated once at the correct size, without the repeated reallocation of ByteArrayOutputStream.

查看更多
姐就是有狂的资本
7楼-- · 2020-05-23 10:38

ByteArrayOutputStream should be okay so long as you specify an appropriate size in the constructor. It will still create a copy when you call toByteArray, but that's only temporary. Do you really mind the memory briefly going up a lot?

Alternatively, if you already know the size to start with you can just create a byte array and repeatedly read from a FileInputStream into that buffer until you've got all the data.

查看更多
登录 后发表回答