I am working on the application which reads large amounts of data from a file. Basically, I have a huge file (around 1.5 - 2 gigs) containing different objects (~5 to 10 millions of them per file). I need to read all of them and put them to different maps in the app. The problem is that the app runs out of memory while reading the objects at some point. Only when I set it to use -Xmx4096m - it can handle the file. But if the file will be larger, it won't be able to do that anymore.
Here's the code snippet:
String sampleFileName = "sample.file";
FileInputStream fileInputStream = null;
ObjectInputStream objectInputStream = null;
try{
fileInputStream = new FileInputStream(new File(sampleFileName));
int bufferSize = 16 * 1024;
objectInputStream = new ObjectInputStream(new BufferedInputStream(fileInputStream, bufferSize));
while (true){
try{
Object objectToRead = objectInputStream.readUnshared();
if (objectToRead == null){
break;
}
// doing something with the object
}catch (EOFException eofe){
eofe.printStackTrace();
break;
} catch (Exception e) {
e.printStackTrace();
continue;
}
}
} catch (Exception e){
e.printStackTrace();
}finally{
if (objectInputStream != null){
try{
objectInputStream.close();
}catch (Exception e2){
e2.printStackTrace();
}
}
if (fileInputStream != null){
try{
fileInputStream.close();
}catch (Exception e2){
e2.printStackTrace();
}
}
}
First of all, I was using objectInputStream.readObject() instead of objectInputStream.readUnshared(), so it solved the issue partially. When I increased the memory from 2048 to 4096, it started parsing the file. BufferedInputStream is already in use. From the web I've found only examples how to read lines or bytes, but nothing regarding objects, performance wise.
How can I read the file without increasing the memory for JVM and avoiding the OutOfMemory exception? Is there any way to read objects from the file, not keeping anything else in the memory?
When reading big files, parsing objects and keeping them in memory there are several solutions with several tradeoffs:
You can fit all parsed objects into memory for that app deployed on one server. It either requires to store all objects in very zipped way, for example using byte or integer to store 2 numbers or some kind of shifting in other data structures. In other words fitting all objects in possible minimum space. Or increase memory for that server(scale vertically)
a) However reading the files can take too much memory, so you have to read them in chunks. For example this is what I was doing with json files:
The idea is to have a way to identify when certain object starts and ends and read only that part.
b) You can also split files to smaller ones at the source if you can, then it will be easier to read them.
You can't fit all parsed objects for that app on one server. In this case you have to shard based on some object property. For example split data based on US state into multiple servers.
Hopefully it helps in your solution.