I have a large text file but doesn't have any line break. It just contains a long String (1 huge line of String with all ASCII characters), but so far anything works just fine as I can be able to read the whole line into memory in Java, but i am wondering if there could be a memory leak issue as the file becomes so big like 5GB+ and the program can't read the whole file into memory at once, so in that case what will be the best way to read such file ? Can we break the huge line into 2 parts or even multiple chunks ?
Here's how I read the file
BufferedReader buf = new BufferedReader(new FileReader("input.txt"));
String line;
while((line = buf.readLine()) != null){
}
A single String can be only 2 billion characters long and will use 2 byte per character, so if you could read a 5 GB line it would use 10 GB of memory.
I suggest you read the text in blocks.
This will use about 16 KB regardless of the size of the file.
To read chunks from file or write same to some file this could be used:
There won't be any kind of memory-leak, as the JVM has its own garbage collector. However you will probably run out of heap space.
In cases like this, it is always best to import and process the stream in manageable pieces. Read in 64MB or so and repeat.
You also might find it useful to add the
-Xmx
parameter to yourjava
call, in order to increase the maximum heap space available within the JVM.You won't run into any memory leak issues, but possible heap space issues. To avoid heap issues, use a buffer.
It all depends on how you are currently reading the line. It is possible to avoid all heap issues by using a buffer.
its better to read the file in chunks and then concatenate the chunks or do whatever you want wit it, because if it is a big file you are reading you will get heap space issues
an easy way to do it like below
In addition to the idea of reading in chunks, you could also look at memory mapping areas of the file using java.nio.MappedByteBuffer. You would still be limited to a maximum buffer size of Integer.MAX_VALUE. This may be better than explicitly reading chunks if you will be making scattered accesses within a chunk.