Java OutOfMemoryError in reading a large text file

2019-01-15 12:34发布

站内文章 / Java

49 0

叼着烟拽天下

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I'm new to Java and working on reading very large files, need some help to understand the problem and solve it. We have got some legacy code which have to be optimized to make it run properly.The file size can vary from 10mb to 10gb only. only trouble start when file starting beyond 800mb size.

InputStream inFileReader = channelSFtp.get(path); // file reading from ssh.
byte[] localbuffer = new byte[2048];
ByteArrayOutputStream bArrStream = new ByteArrayOutputStream();

int i = 0;
while (-1 != (i = inFileReader.read(buffer))) {
bArrStream.write(localbuffer, 0, i);
}

byte[] data = bArrStream.toByteArray();
inFileReader.close();
bos.close();

We are getting the error

java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:2271)
    at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
    at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
    at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)

Any help would be appreciated?

回答1:

Try to use java.nio.MappedByteBuffer.

http://docs.oracle.com/javase/7/docs/api/java/nio/MappedByteBuffer.html

You can map a file's content onto memory without copying it manually. High-level Operating Systems offer memory-mapping and Java has API to utilize the feature.

If my understanding is correct, memory-mapping does not load a file's entire content onto memory (meaning "loaded and unloaded partially as necessary"), so I guess a 10GB file won't eat up your memory.

回答2:

Even though you can increase the JVM memory limit, it is needless and allocating a huge memory like 10GB to process a file sounds overkill and resource intensive.

Currently you are using a "ByteArrayOutputStream" which keeps an internal memory to keep the data. This line in your code keeps appending the last read 2KB file chunk to the end of this buffer:

bArrStream.write(localbuffer, 0, i);

bArrStream keeps growing and eventually you run out of memory.

Instead you should reorganize your algorithm and process the file in a streaming way:

InputStream inFileReader = channelSFtp.get(path); // file reading from ssh.
byte[] localbuffer = new byte[2048];

int i = 0;
while (-1 != (i = inFileReader.read(buffer))) {
    //Deal with the current read 2KB file chunk here
}

inFileReader.close();

回答3:

The Java virtual machine (JVM) runs with a fixed upper memory limit, which you can modify thus:

java -Xmx1024m ....

e.g. the above option (-Xmx...) sets the limit to 1024 megabytes. You can amend as necessary (within limits of your machine, OS etc.) Note that this is different from traditional applications which would allocate more and more memory from the OS upon demand.

However a better solution is to rework your application such that you don't need to load the whole file into memory at one go. That way you don't have to tune your JVM, and you don't impose a huge memory footprint.

回答4:

Run Java with the command-line option -Xmx, which sets the maximum size of the heap.

See here for details..

回答5:

You can't read 10GB Textfile in memory. You have to read X MB first, do something with it and than read the next X MB.

回答6:

Try using a large buffer read size may be 10 mb and then check.

回答7:

The problem is inherent in what you're doing. Reading entire files into memory is always and everywhere a bad idea. You're really not going to be able to read a 10GB file into memory with current technology unless you have some pretty startling hardware. Find a way to process them line by line, record by record, chunk by chunk, ...

回答8:

Is it mandatory to get entire ByteArray() of output stream?

byte[] data = bArrStream.toByteArray();

Best approach is read line by line & write it line by line. You can use BufferedReader or Scanner to read large files as below.

import java.io.*;
import java.util.*;

public class FileReadExample {
  public static void main(String args[]) throws FileNotFoundException {
    File fileObj = new File(args[0]);

    long t1 = System.currentTimeMillis();
    try {
        // BufferedReader object for reading the file
        BufferedReader br = new BufferedReader(new FileReader(fileObj)); 
        // Reading each line of file using BufferedReader class
        String str;
        while ( (str = br.readLine()) != null) {
            System.out.println(str);
        }
    }catch(Exception err){
        err.printStackTrace();
    }
    long t2 = System.currentTimeMillis();
    System.out.println("Time taken for BufferedReader:"+(t2-t1));

    t1 = System.currentTimeMillis();
    try (
        // Scanner object for reading the file
        Scanner scnr = new Scanner(fileObj);) {
        // Reading each line of file using Scanner class
        while (scnr.hasNextLine()) {
            String strLine = scnr.nextLine();
            // print data on console
            System.out.println(strLine);
        }
    }
    t2 = System.currentTimeMillis();
    System.out.println("Time taken for scanner:"+(t2-t1));

  }
}

You can replace System.out with your ByteArrayOutputStream in above example.

Please have a look at below article for more details: Read Large File

Have a look at related SE question:

Scanner vs. BufferedReader

回答9:

ByteArrayOutputStream writes to an in-memory buffer. If this is really how you want it to work, then you have to size the JVM heap after the maximum possible size of the input. Also, if possible, you may check the input size before even start processing to save time and resources.

The alternative approach is a streaming solution, where the amount of memory used at runtime is known (maybe configurable but still known before the program starts), but if it's feasible or not depends entirely on you application's domain (because you can't use an in-memory buffer anymore) and maybe the architecture of the rest of your code if you can't/don't want to change it.

回答10:

Hi I am assuming that you are reading large txt file and the data is set line by line , use line by line reading approach. As I know you can read up to 6GB may be more. I strongly advice you to try this approach.

DATA1 DATA2 ...

// Open the file
 FileInputStream fstream = new FileInputStream("textfile.txt");
 BufferedReader br = new BufferedReader(new InputStreamReader(fstream));

  String strLine;

 //Read File Line By Line
 while ((strLine = br.readLine()) != null)   {
  // Print the content on the console
  System.out.println (strLine);
 }

 //Close the input stream
 br.close();

Refrence for the code fragment

回答11:

Read the file iteratively linewise. This would significantly reduce memory consumption. Alternately you may use

FileUtils.lineIterator(theFile, "UTF-8");

provided by Apache Commons IO.

FileInputStream inputStream = null;
Scanner sc = null;
try {
inputStream = new FileInputStream(path);
sc = new Scanner(inputStream, "UTF-8");
while (sc.hasNextLine()) {
    String line = sc.nextLine();
    // System.out.println(line);
}
// note that Scanner suppresses exceptions
if (sc.ioException() != null) {
    throw sc.ioException();
}
} finally {
if (inputStream != null) {
    inputStream.close();
}
if (sc != null) {
    sc.close();
}

}

回答12:

You should increase heap size as stated in the following answer:

Increase heap size in Java

But remember that the Java runtime and you code take some space as well so add some buffer to the desired maximum.

回答13:

Short answer,

without doing anything, you can push the current limit by a factor of 1.5. It means that, if you are able to process 800MB, you can process 1200 MB. It also means that if by some trick with java -Xm .... you can move to a point where your current code can process 7GB, your problem is solved, because the 1.5 factor will take you to 10.5GB, assuming you have that space available on your system and that JVM can get it.

Long answer:

The error is pretty self-descriptive. You hit the practical memory limit on your configuration. There is a lot of speculating about the limit that you can have with JVM, I do not know enough about that, since I can not find any official information. However, you will somehow be limited by constraints like the available swap, the kernel address space usage, the memory fragmentation, etc.

What is happening now is that ByteArrayOutputStream objects are created with a default buffer of size 32 if you do not supply any size (this is your case). Whenever you call the write method on the object, there is an internal machinery that is started. The openjdk implementation release 7u40-b43 that seems to match perfectly with the output of your error, uses an internal method ensureCapacity to check that the buffer has enough room to put the bytes you want to write. If there is not enough room, another internal method grow is called to grow the size of the buffer. The method grow defines the appropriate size and calls the method copyOf from the class Arrays to do the job. The appropriate size of the buffer is the maximum between the current size and the size riquired to hold all the content (the present content and the new content to be write). The method copyOf from the class Arrays (follow the link) allocates the space for the new buffer, copy the content of the old buffer to the new one and return it to grow.

Your problem occurs at the allocation of the space for the new buffer, After some write, you got to a point where the available memory is exhausted: java.lang.OutOfMemoryError: Java heap space.

If we look into details, you are reading by chunks of 2048. So

your first write to the grows the size of the buffer from 32 to 2048
your second call will double it to 2*2048
your third call will take it to 2^2*2048, you have to time to write two more times before the need of allocating.
then 2^3*2048, you will have the time for 4 mores writes before allocating again.
at some point, your buffer will be of size 2^18*2048 which is 2^19*1024 or 2^9*2^20 (512 MB)
then 2^19*2048 which is 1024 MB or 1 GB

Something that is unclear in your description is that you can somehow read up to 800MB, but can no go beyond. You have to explain that to me.

I expect that your limit be exactly a power of 2 (or close if we use power of 10 units somewere). In that regard, I expect you to start having trouble immediatly above one of these: 256MB, 512 MB, 1GB, 2GB, etc.

When you hit that limit, it does not mean that you are out of memory, it simply means that it is not possible to allocate another buffer of twice the size of the buffer you already have. This observation opens room for improvement in your work: find the maximum size of buffer that you can allocate and reserve it upfront by calling the appropriate constructor

ByteArrayOutputStream bArrStream = new ByteArrayOutputStream(myMaxSize);

It has the advantage of reducing the overhead background memory allocation that happens under the hood to keep you happy. By doing this, you will be able to go to 1.5 the limit you have right now. This is simply because the last time the buffer was increased, it went from half the current size to the current size, and at some point you had both the current buffer and the old one together in memory. But you will not be able to go beyond 3 times the limit you are having now. The explanation is exactly the same.

That been said, I do not have any magic suggestion to solve the problem apart from process your data by chunks of given size, one chunk at a time. Another good approach will be to use the suggestion of Takahiko Kawasaki and use MappedByteBuffer. Keep in mind that in any case you will need at least 10 GB of physical memory or swap memory to be able to load a file of 10GB.

see

回答14:

After thinking about it, I decided to put a second answer. I considered the advantages and disadvantages of putting this second answer, and the advantages are worth going for it. So here it is.

Most of the suggested considerations are forgetting a given fact: There is a builtin limit in the size of arrays (including ByteArrayOutputStream) that you can have in Java. And that limit is dictated by the bigest int value which is 2^31 - 1(little bit less than 2Giga). This means that you can only read a maximum of 2 GB (-1 byte) and put it in a single ByteArrayOutputStream. The limit might actually be smaller for array size if the VM wants more control.

My suggestion is to use an ArrayList of byte[] instead of a single byte[] holding the full content of the file. And also remove the non necessary step of putting in ByteArrayOutputStream before putting it in a final data array. Here is an example based on your original code:

InputStream inFileReader = channelSFtp.get(path); // file reading from ssh.

// good habits are good, define a buffer size
final int BUF_SIZE = (int)(Math.pow(2,30)); //1GB, let's not go close to the limit

byte[] localbuffer = new byte[BUF_SIZE];

int i = 0;
while (-1 != (i = inFileReader.read(localbuffer))) {
    if(i<BUF_SIZE){
        data.add( Arrays.copyOf(localbuffer, i) )
        // No need to reallocate the reading buffer, we copied the data
    }else{
        data.add(localbuffer)
        // reallocate the reading buffer
        localbuffer = new byte[BUF_SIZE]
    }
}

inFileReader.close();
// Process your data, keep in mind that you have a list of buffers.
// So you need to loop over the list

Simply running your program should work fine on 64 bits system with enough physical memory or swap. Now if you want to speed it up to help the VM size correctly the heap at the beginning, run with the options -Xms and -Xmx. For example if you want a heap of 12GB to be able to handle 10GB file, use java -Xms12288m -Xmx12288m YourApp