I need to iterate through an Excel spreadsheet row by row, but not all at once. I want to keep a stream open to the spreadsheet, and then pass that object along to another method. This method will ask for the next row of data from time to time. This method is already established, so I can't change it.
I tried doing this initially with XSSFWorkbook
, which worked really well, until I ran out of memory on real data.
Then I tried switching to the SAX parser, using XSSFSheetXMLHandler
, with a custom parser similar to the sample one XLSX2CSV
provided by the Apache POI project. But this results in all rows processing at one time; and if I store them in memory to read later I also run out of memory. I also didn't have access to the DataFormatter
anymore, which I need for cell value processing.
Can someone point me to an example/class that will allow me to do this?
My preferrred Streaming API for XML is StAX.
Knowing that a
*.xlsx
file is simply aZIP
archive andapache poi
s OPCPackage for this is aZipPackage
, we could think about the following approach:/xl/worksheets/sheetN.xml
package part from the*.xlsx Excel
ZipPackage
.StAX
reader on it.XML
using this reader.The following example creates a rudimentary application which is doing this row by row using a button click.
For sure this is only a draft to show the principle. The whole application would be much more code.
Next we would have to read and parse the
/xl/sharedStrings.xml
package part, which contains the shared strings. Also we have to read and parse the/xl/styles.xml
package part which contains the cell styles. We need the styles to detect whether a numeric value is a date or a number and if a number, then what kind of number. This is necessary becauseExcel
stores all kinds of numbers as double values. And dates also are numbers as doubles, meaning days after 01/01/1900 with fractional part as 1h = 1/24, 1m = 1/24/60, 1s = 1/24/60/60.But this is possible using the same approach as with the
/xl/worksheets/sheetN.xml
package part.