Apache POI much quicker using HSSF than XSSF - wha

2019-04-20 18:48发布

I've been having some issues with parsing .xlsx files with Apache POI - I am getting java.lang.OutOfMemoryError: Java heap space in my deployed app. I'm only processing files under 5MB and around 70,000 rows so my suspicion from reading number other questions is that something is amiss.

As suggested in this comment I decided to run SSPerformanceTest.java with the suggested variables so see if there is anything wrong with my code or setup. The results show a significant difference between HSSF (.xls) and XSSF (.xlsx):

1) HSSF 50000 50 1: Elapsed 1 seconds

2) SXSSF 50000 50 1: Elapsed 5 seconds

3) XSSF 50000 50 1: Elapsed 15 seconds

The FAQ specifically says:

If you can't run that with 50,000 rows and 50 columns in all of HSSF, XSSF and SXSSF in under 3 seconds (ideally a lot less!), the problem is with your environment.

Next, it says to run XLS2CSV.java which I have done. Feeding in the XSSF file generated above (with 50000 rows and 50 columns) takes around 15 seconds - the same amount it took to write the file.

Is something wrong with my environment, and if so how do I investigate further?

Stats from VisualVM show the heap used shooting up to 1.2Gb during the processing. Surely this is way too high considering that's an extra gig on top of the heap compared to before processing began?

Heap space is surely too high here?

Note: The heap space exception mentioned above only happens in production (on Google App Engine) and only for .xlsx files, however the tests mentioned in this question have all been run on my development machine with -Xmx2g. I'm hoping that if I can fix the problem on my development setup it will use less memory when I deploy.

Stack trace from app engine:

Caused by: java.lang.OutOfMemoryError: Java heap space at org.apache.xmlbeans.impl.store.Cur.createElementXobj(Cur.java:260) at org.apache.xmlbeans.impl.store.Cur$CurLoadContext.startElement(Cur.java:2997) at org.apache.xmlbeans.impl.store.Locale$SaxHandler.startElement(Locale.java:3211) at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.reportStartTag(Piccolo.java:1082) at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseAttributesNS(PiccoloLexer.java:1802) at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseOpenTagNS(PiccoloLexer.java:1521)

2条回答
该账号已被封号
2楼-- · 2019-04-20 19:00

The average XLSX sheet I work is about 18-22 sheets of 750 000 rows with 13-20 columns. This is spinning in the Spring web application with lots of other functionalities. I gave to whole application not that much of memory: -Xms1024m -Xmx4096m - and it works great!

First of all dumping code: it is wrong to load each and every data row in memory and than starting to dump it. In my case (reporting from the PostgreSQL database) I reworked data dump procedure to use RowCallbackHandler to write to my XLSX, during this once I reach "my limit" of 750000 rows, I create new sheet. And workbook is created with visibility window of 50 rows. In this way I am able to dump huge volumes: size of XLSX file is about 1230Mb.

Some code to write sheets:

    jdbcTemplate.query(
        new PreparedStatementCreator() {
            @Override
            public PreparedStatement createPreparedStatement(Connection connection) throws SQLException {
                PreparedStatement statement = connection.prepareStatement(finalQuery, ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY);
                statement.setFetchSize(100);
                statement.setFetchDirection(ResultSet.FETCH_FORWARD);
                return statement;
            }
        }, new RowCallbackHandler() {
            Sheet sheet = null;
            int i = 750000;
            int tableId = 0;

            @Override
            public void processRow(ResultSet resultSet) throws SQLException {
                if (i == 750000) {
                    tableId++;
                    i = 0;
                    sheet = wb.createSheet(sheetName.concat(String.format("%02d%n", tableId)));


                    Row r = sheet.createRow(0);

                    Cell c = r.createCell(0);
                    c.setCellValue("id");
                    c = r.createCell(1);
                    c.setCellValue("Дата");
                    c = r.createCell(2);
                    c.setCellValue("Комментарий");
                    c = r.createCell(3);
                    c.setCellValue("Сумма операции");
                    c = r.createCell(4);
                    c.setCellValue("Дебет");
                    c = r.createCell(5);
                    c.setCellValue("Страхователь");
                    c = r.createCell(6);
                    c.setCellValue("Серия договора");
                    c = r.createCell(7);
                    c.setCellValue("Номер договора");
                    c = r.createCell(8);
                    c.setCellValue("Основной агент");
                    c = r.createCell(9);
                    c.setCellValue("Кредит");
                    c = r.createCell(10);
                    c.setCellValue("Программа");
                    c = r.createCell(11);
                    c.setCellValue("Дата начала покрытия");
                    c = r.createCell(12);
                    c.setCellValue("Дата планового окончания покрытия");
                    c = r.createCell(13);
                    c.setCellValue("Периодичность уплаты взносов");
                }
                i++;

                PremiumEntity e = PremiumEntity.builder()
                    .Id(resultSet.getString("id"))
                    .OperationDate(resultSet.getDate("operation_date"))
                    .Comments(resultSet.getString("comments"))
                    .SumOperation(resultSet.getBigDecimal("sum_operation").doubleValue())
                    .DebetAccount(resultSet.getString("debet_account"))
                    .Strahovatelname(resultSet.getString("strahovatelname"))
                    .Seria(resultSet.getString("seria"))
                    .NomPolica(resultSet.getLong("nom_polica"))
                    .Agentname(resultSet.getString("agentname"))
                    .CreditAccount(resultSet.getString("credit_account"))
                    .Program(resultSet.getString("program"))
                    .PoliciStartDate(resultSet.getDate("polici_start_date"))
                    .PoliciPlanEndDate(resultSet.getDate("polici_plan_end_date"))
                    .Periodichn(resultSet.getString("id_periodichn"))
                    .build();

                Row r = sheet.createRow(i);
                Cell c = r.createCell(0);
                c.setCellValue(e.getId());

                if (e.getOperationDate() != null) {
                    c = r.createCell(1);
                    c.setCellStyle(dateStyle);
                    c.setCellValue(e.getOperationDate());
                }

                c = r.createCell(2);
                c.setCellValue(e.getComments());

                c = r.createCell(3);
                c.setCellValue(e.getSumOperation());

                c = r.createCell(4);
                c.setCellValue(e.getDebetAccount());

                c = r.createCell(5);
                c.setCellValue(e.getStrahovatelname());

                c = r.createCell(6);
                c.setCellValue(e.getSeria());

                c = r.createCell(7);
                c.setCellValue(e.getNomPolica());

                c = r.createCell(8);
                c.setCellValue(e.getAgentname());

                c = r.createCell(9);
                c.setCellValue(e.getCreditAccount());

                c = r.createCell(10);
                c.setCellValue(e.getProgram());

                if (e.getPoliciStartDate() != null) {
                    c = r.createCell(11);
                    c.setCellStyle(dateStyle);
                    c.setCellValue(e.getPoliciStartDate());
                }
                ;

                if (e.getPoliciPlanEndDate() != null) {
                    c = r.createCell(12);
                    c.setCellStyle(dateStyle);
                    c.setCellValue(e.getPoliciPlanEndDate());
                }

                c = r.createCell(13);
                c.setCellValue(e.getPeriodichn());
            }
        });

After reworking my code on dumping the data to XLSX, I came to problem, that it requires Office in 64 bits to open them. So I need to split my workbook with lots of sheets into separate XLSX files with single sheets to make them readable on average machine. And again I used small visibility windows and streamed processing, and kept the whole application working well without any sights of OutOfMemory.

Some code to read and split sheets:

        OPCPackage opcPackage = OPCPackage.open(originalFile, PackageAccess.READ);


        ReadOnlySharedStringsTable strings = new ReadOnlySharedStringsTable(opcPackage);
        XSSFReader xssfReader = new XSSFReader(opcPackage);
        StylesTable styles = xssfReader.getStylesTable();
        XSSFReader.SheetIterator iter = (XSSFReader.SheetIterator) xssfReader.getSheetsData();
        int index = 0;
        while (iter.hasNext()) {
            InputStream stream = iter.next();
            String sheetName = iter.getSheetName();

            DataFormatter formatter = new DataFormatter();
            InputSource sheetSource = new InputSource(stream);

            SheetToWorkbookSaver saver = new SheetToWorkbookSaver(sheetName);
            try {
                XMLReader sheetParser = SAXHelper.newXMLReader();
                ContentHandler handler = new XSSFSheetXMLHandler(
                    styles, null, strings, saver, formatter, false);
                sheetParser.setContentHandler(handler);
                sheetParser.parse(sheetSource);
            } catch(ParserConfigurationException e) {
                throw new RuntimeException("SAX parser appears to be broken - " + e.getMessage());
            }

            stream.close();

            // this creates new File descriptors inside storage
            FileDto partFile = new FileDto("report_".concat(StringUtils.trimToEmpty(sheetName)).concat(".xlsx"));
            File cloneFile = fileStorage.read(partFile);
            FileOutputStream cloneFos = new FileOutputStream(cloneFile);
            saver.getWb().write(cloneFos);
            cloneFos.close();
        }

and

public class SheetToWorkbookSaver implements XSSFSheetXMLHandler.SheetContentsHandler {

    private SXSSFWorkbook wb;
    private Sheet sheet;
    private CellStyle dateStyle ;


    private Row currentRow;

    public SheetToWorkbookSaver(String workbookName) {
        this.wb = new SXSSFWorkbook(50);
        this.dateStyle = this.wb.createCellStyle();
        this.dateStyle.setDataFormat(this.wb.getCreationHelper().createDataFormat().getFormat("dd.mm.yyyy"));

        this.sheet = this.wb.createSheet(workbookName);

    }

    @Override
    public void startRow(int rowNum) {
        this.currentRow = this.sheet.createRow(rowNum);
    }

    @Override
    public void endRow(int rowNum) {

    }

    @Override
    public void cell(String cellReference, String formattedValue, XSSFComment comment) {
        int thisCol = (new CellReference(cellReference)).getCol();
        Cell c = this.currentRow.createCell(thisCol);
        c.setCellValue(formattedValue);
        c.setCellComment(comment);
    }

    @Override
    public void headerFooter(String text, boolean isHeader, String tagName) {

    }


    public SXSSFWorkbook getWb() {
        return wb;
    }
}

So it reads and writes data. I guess in your case you should rework your code to same patterns: keep in memory only small footprint of data. So I would suggest for reading create custom SheetContentsReader, which will be pushing data to some database, where it can be easily processed, aggregated, etc.

查看更多
做个烂人
3楼-- · 2019-04-20 19:18

I was facing same kind of issue to read bulky .xlsx file using Apache POI and I came across

excel-streaming-reader-github

This library serves as a wrapper around that streaming API while preserving the syntax of the standard POI API

This library can help you to read large files.

查看更多
登录 后发表回答