Java - Read a big file (few GB) without ex

2019-09-22 10:37发布

This question is very short. I have a File

Datei.trec-3,99 GB and i read it with this code:

public class Main {
    public static void main(String[] args) {
        byte[] content = null;
        try {
            content = Files.readAllBytes(Paths.get("D:", "Videos","Captures","Datei.trec"));
        } catch (IOException e) {
            e.printStackTrace();
        }
        System.out.println(content);
    }
}

and this is the output:

Exception in thread "main" java.lang.OutOfMemoryError: Required array size too large
    at java.nio.file.Files.readAllBytes(Unknown Source)
    at Main.main(Main.java:13)

Is there an way to read the array without an exception (Streams ect.)? The file is smaller than the allowed HEAP so it should be possible to store all the data at once in the program.

2条回答
ら.Afraid
2楼-- · 2019-09-22 10:41

I'd recommend you to stream through the file; you can use for example LineIterator, from Apache Commons:

LineIterator it = FileUtils.lineIterator(theFile, "UTF-8");
try {
    while (it.hasNext()) {
        String line = it.next();
    }
} finally {
    LineIterator.closeQuietly(it);
}
查看更多
手持菜刀,她持情操
3楼-- · 2019-09-22 11:06

The issue is that the array required to hold all that data is larger than MAX_BUFFER_SIZE, which is defined in java.nio.Files as Integer.MAX_VALUE - 8:

public static byte[] readAllBytes(Path path) throws IOException {
        try (SeekableByteChannel sbc = Files.newByteChannel(path);
             InputStream in = Channels.newInputStream(sbc)) {
            long size = sbc.size();
            if (size > (long)MAX_BUFFER_SIZE)
                throw new OutOfMemoryError("Required array size too large");

            return read(in, (int)size);
        }
    }

This is necessary because arrays are indexed by integers - this is the biggest array you can get.

You have three options:

Stream through the file

That is, open the file, read a chunk, process it, read another chunk, process it, again and again until you've gone through the whole thing.

Java provides lots of classes to do this: InputStream, Reader, Scanner etc. -- they are discussed early in most introductory Java courses and books. Study one of these.

Example https://stackoverflow.com/a/21706141/7512

The usefulness of this depends on you being able to do something worthwhile with an early part of the file, without knowing what's coming. A lot of the time this is the case. Other times you have to make more than one pass through the file.

File formats are often designed so that processing can be done in a single pass -- it's a good idea to design your own file formats with this in mind.

I note that your file is a .trec file, which is a screen-captured video. Video and audio formats are especially likely to be designed for streaming -- which is the reason you can watch the start of a YouTube video before the end has downloaded.

Memory mapping

If you really need to jump around the content of the file to process it, you can open it as a memory mapped file.

Look at the documentation for RandomAccessFile - this gives you an object with a seek() method so you can read arbitrary points in the file's data.

Read to multiple arrays

I include this only for completeness; it's ugly to slurp the whole file into heap memory. But if you really wanted to, you could store the bytes in a number of arrays -- perhaps a List<byte[]>. Java-ish pseudocode:

  List<byte[]> filecontents = new ArrayList<byte[]>();
  InputStream is = new FileInputStream(...);
  byte[] buffer = new byte[MAX_BUFFER_SIZE];
  int bytesGot = readUpToMaxBufferSizeFrom(file);
  while(bytesGot != -1) {
       byte[] chunk = new byte[bytesGot];
       System.arrayCopy(buffer, 0, chunk, 0, bytesGot);
       filecontents.add(chunk);
  }

This allows you up to MAX_BUFFER_SIZE * Integer.MAX_INTEGER bytes. Accessing the contents is slightly more fiddly than using a simple array - but that implementation detail can be hidden inside a class.

You would, of course, need to configure Java to have a huge heap size - see How to set the maximum memory usage for JVM?

Don't do it.

查看更多
登录 后发表回答