I've been having some issues with parsing .xlsx
files with Apache POI - I am getting java.lang.OutOfMemoryError: Java heap space
in my deployed app. I'm only processing files under 5MB and around 70,000 rows so my suspicion from reading number other questions is that something is amiss.
As suggested in this comment I decided to run SSPerformanceTest.java
with the suggested variables so see if there is anything wrong with my code or setup. The results show a significant difference between HSSF (.xls
) and XSSF (.xlsx
):
1) HSSF 50000 50 1: Elapsed 1 seconds
2) SXSSF 50000 50 1: Elapsed 5 seconds
3) XSSF 50000 50 1: Elapsed 15 seconds
The FAQ specifically says:
If you can't run that with 50,000 rows and 50 columns in all of HSSF, XSSF and SXSSF in under 3 seconds (ideally a lot less!), the problem is with your environment.
Next, it says to run XLS2CSV.java
which I have done. Feeding in the XSSF file generated above (with 50000 rows and 50 columns) takes around 15 seconds - the same amount it took to write the file.
Is something wrong with my environment, and if so how do I investigate further?
Stats from VisualVM show the heap used shooting up to 1.2Gb during the processing. Surely this is way too high considering that's an extra gig on top of the heap compared to before processing began?
Note: The heap space exception mentioned above only happens in production (on Google App Engine) and only for .xlsx
files, however the tests mentioned in this question have all been run on my development machine with -Xmx2g
. I'm hoping that if I can fix the problem on my development setup it will use less memory when I deploy.
Stack trace from app engine:
Caused by: java.lang.OutOfMemoryError: Java heap space at org.apache.xmlbeans.impl.store.Cur.createElementXobj(Cur.java:260) at org.apache.xmlbeans.impl.store.Cur$CurLoadContext.startElement(Cur.java:2997) at org.apache.xmlbeans.impl.store.Locale$SaxHandler.startElement(Locale.java:3211) at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.reportStartTag(Piccolo.java:1082) at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseAttributesNS(PiccoloLexer.java:1802) at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseOpenTagNS(PiccoloLexer.java:1521)
The average XLSX sheet I work is about 18-22 sheets of 750 000 rows with 13-20 columns. This is spinning in the Spring web application with lots of other functionalities. I gave to whole application not that much of memory:
-Xms1024m -Xmx4096m
- and it works great!First of all dumping code: it is wrong to load each and every data row in memory and than starting to dump it. In my case (reporting from the PostgreSQL database) I reworked data dump procedure to use
RowCallbackHandler
to write to my XLSX, during this once I reach "my limit" of 750000 rows, I create new sheet. And workbook is created with visibility window of 50 rows. In this way I am able to dump huge volumes: size of XLSX file is about 1230Mb.Some code to write sheets:
After reworking my code on dumping the data to XLSX, I came to problem, that it requires Office in 64 bits to open them. So I need to split my workbook with lots of sheets into separate XLSX files with single sheets to make them readable on average machine. And again I used small visibility windows and streamed processing, and kept the whole application working well without any sights of OutOfMemory.
Some code to read and split sheets:
and
So it reads and writes data. I guess in your case you should rework your code to same patterns: keep in memory only small footprint of data. So I would suggest for reading create custom
SheetContentsReader
, which will be pushing data to some database, where it can be easily processed, aggregated, etc.I was facing same kind of issue to read bulky .xlsx file using Apache POI and I came across
excel-streaming-reader-github
This library serves as a wrapper around that streaming API while preserving the syntax of the standard POI API
This library can help you to read large files.