Adding a row to a large xlsx file (Out of Memo

2019-03-06 12:53发布

问题:

The situation is as follows; I have a simple program which uses the Apache Poi Library to add one row of data at the end of the an exisiting xlsx file. See below

File file = new File(input);
XSSFWorkbook workbook = new XSSFWorkbook(file);
XSSFSheet sheet = workbook.getSheetAt(0);
XSSFRow row = sheet.createRow(sheet.getLastRowNum() + 1);

After this I will iterate over the row and set the CellValues. But the problem is that on the second line of the code, as shown above, I get an out of memory error. Is there a way to add a row of data to the existing xlsx file without having to read the file fully?

回答1:

You can try XSSF and SAX (Event API).



回答2:

If getting the XSSFWorkbook fails because of out-of-memory error and the need is to read and write the workbook, then neither SXSSF nor SAX parser will help. The one is only for writing. The other is only for reading.

Both approaches in follow needs knowledge about the *.xlsx file format which is Office Open XML. In general a *.xlsx file is a ZIP archive containing XML files and other files in a special directory structure. So one can unzip the *.xlsx file using a ZIP software to have a look at the XML files. The file format was first standardized by Ecma. So for further recherches I prefer Ecma Markup Language Reference. For example Row.

The ReadAndWriteTest.xlsx used in both examples must have at least one worksheet and the first worksheet must have at least one row.

One approach could be using the DOM methods of XMLBeans. My favorite reference for this is grepcode.

Example:

import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.openxml4j.opc.PackagePart;

import org.apache.poi.xssf.model.SharedStringsTable;

import java.io.File;
import java.io.OutputStream;

import org.openxmlformats.schemas.spreadsheetml.x2006.main.WorksheetDocument;
import org.openxmlformats.schemas.spreadsheetml.x2006.main.CTWorksheet;
import org.openxmlformats.schemas.spreadsheetml.x2006.main.CTSheetData;
import org.openxmlformats.schemas.spreadsheetml.x2006.main.CTRst;
import org.openxmlformats.schemas.spreadsheetml.x2006.main.CTCell;
import org.openxmlformats.schemas.spreadsheetml.x2006.main.STCellType;

import  org.openxmlformats.schemas.officeDocument.x2006.relationships.STRelationshipId;

import org.apache.xmlbeans.XmlOptions;

import javax.xml.namespace.QName;

import java.util.Map;
import java.util.HashMap;

import java.util.regex.Pattern;

class DOMReadAndWriteTest {

 public static void main(String[] args) {
  try {

   File file = new File("ReadAndWriteTest.xlsx");
   //we only open the OPCPackage, we don't create a Workbook
   OPCPackage opcpackage = OPCPackage.open(file);

   //if there are strings in the SheetData, we need the SharedStringsTable
   PackagePart sharedstringstablepart = opcpackage.getPartsByName(Pattern.compile("/xl/sharedStrings.xml")).get(0);
   SharedStringsTable sharedstringstable = new SharedStringsTable();
   sharedstringstable.readFrom(sharedstringstablepart.getInputStream());

   //get the PackagePart of the first sheet
   PackagePart sheetpart = opcpackage.getPartsByName(Pattern.compile("/xl/worksheets/sheet1.xml")).get(0);
   //get the worksheet from the first sheet's XML
   //if it even fails while parsing this, then this approach is not usable
   WorksheetDocument worksheetdocument = WorksheetDocument.Factory.parse(sheetpart.getInputStream());
   CTWorksheet worksheet = worksheetdocument.getWorksheet();

   CTSheetData sheetdata = worksheet.getSheetData();

   //put some data in 10 new rows"
   for (int i = 0; i < 10; i++) {
    int rowsCount = sheetdata.sizeOfRowArray();

    CTCell ctcell= sheetdata.addNewRow().addNewC();

    CTRst ctstr = CTRst.Factory.newInstance();
    ctstr.setT("new Row " + (rowsCount + 1));
    int sRef = sharedstringstable.addEntry(ctstr);
    ctcell.setT(STCellType.S);
    ctcell.setV(Integer.toString(sRef));

    ctcell=sheetdata.getRowArray(rowsCount).addNewC();
    ctcell.setV(""+rowsCount+"."+(i+1)+""+((i+2>9)?0:i+2));
   }

   //write the SharedStringsTable
   OutputStream out = sharedstringstablepart.getOutputStream();
   sharedstringstable.writeTo(out);
   out.close();

   //create XmlOptions for saving the worksheet
   XmlOptions xmlOptions = new XmlOptions();
   xmlOptions.setSaveOuter();
   xmlOptions.setUseDefaultNamespace();
   xmlOptions.setSaveAggressiveNamespaces();
   xmlOptions.setCharacterEncoding("UTF-8");
   xmlOptions.setSaveSyntheticDocumentElement(new QName(CTWorksheet.type.getName().getNamespaceURI(), "worksheet"));
   Map<String, String> map = new HashMap<String, String>();
   map.put(STRelationshipId.type.getName().getNamespaceURI(), "r");
   xmlOptions.setSaveSuggestedPrefixes(map);

   //save the worksheet
   out = sheetpart.getOutputStream();
   worksheet.save(out, xmlOptions);
   out.close();

   opcpackage.close();

  } catch (Exception ex) {
     ex.printStackTrace();
  }
 }
}

This code writes 10 new Rows in sheet1 of ReadAndWriteTest.xlsx without opening the whole workbook. But it must at least opening and parsing the sheet1 and the SharedStringsTable. If even this fails, then this approach is not usable.

Another approach could be using StAX. This API can read and write XML event driven. And it uses streaming.

Example:

import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.openxml4j.opc.PackagePart;

import org.apache.poi.xssf.model.SharedStringsTable;

import org.openxmlformats.schemas.spreadsheetml.x2006.main.CTRst;

import javax.xml.stream.XMLEventFactory;
import javax.xml.stream.XMLEventReader;
import javax.xml.stream.XMLEventWriter;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLOutputFactory;
import javax.xml.stream.events.Characters;
import javax.xml.stream.events.StartElement;
import javax.xml.stream.events.EndElement;
import javax.xml.stream.events.Attribute;
import javax.xml.stream.events.XMLEvent;

import javax.xml.namespace.QName;

import java.io.File;
import java.io.InputStream;
import java.io.OutputStream;

import java.util.Arrays;
import java.util.List;
import java.util.regex.Pattern;

class StaxReadAndWriteTest {

 public static void main(String[] args) {
  try {

   File file = new File("ReadAndWriteTest.xlsx");
   OPCPackage opcpackage = OPCPackage.open(file);

   //if there are strings in the sheet data, we need the SharedStringsTable
   //if it even fails while parsing this SharedStringsTable, then this approach is not usable
   //then we must stream this XML event driven also.
   PackagePart sharedstringstablepart = opcpackage.getPartsByName(Pattern.compile("/xl/sharedStrings.xml")).get(0);
   SharedStringsTable sharedstringstable = new SharedStringsTable();
   sharedstringstable.readFrom(sharedstringstablepart.getInputStream());

   PackagePart sheetpart = opcpackage.getPartsByName(Pattern.compile("/xl/worksheets/sheet1.xml")).get(0);

   XMLEventReader reader = XMLInputFactory.newInstance().createXMLEventReader(sheetpart.getInputStream());
   XMLEventWriter writer = XMLOutputFactory.newInstance().createXMLEventWriter(sheetpart.getOutputStream());

   XMLEventFactory eventFactory = XMLEventFactory.newInstance();

   int rowsCount = 0;

   while(reader.hasNext()){ //loop over all XML in sheet1.xml
    XMLEvent event = (XMLEvent)reader.next();
    writer.add(event); //by default write each readed event

    if(event.isStartElement()){
     StartElement startElement = (StartElement)event;
     QName startElementName = startElement.getName();
     if(startElementName.getLocalPart().equalsIgnoreCase("row")) { //start element of row
      boolean rowStart = true;
      rowsCount++;
      do {
       event = (XMLEvent)reader.next(); //find this row's end
       writer.add(event); //by default write each readed event

       if(event.isEndElement()){
        EndElement endElement = (EndElement)event;
        QName endElementName = endElement.getName();
        if(endElementName.getLocalPart().equalsIgnoreCase("row")) { //end element of row
         rowStart = false;
         //we assume that there is nothing else (character data) between end element of row and next element 
         XMLEvent nextElement = (XMLEvent)reader.peek();
         QName nextElementName = null;
         if (nextElement.isStartElement()) nextElementName = ((StartElement)nextElement).getName();
         else if (nextElement.isEndElement()) nextElementName = ((EndElement)nextElement).getName();
         if(!nextElementName.getLocalPart().equalsIgnoreCase("row")) { //next is not start element of row
          //we have the last row, so we write new rows now 

          for (int i = 0; i < 10; i++) {

           StartElement newRowStart = eventFactory.createStartElement(new QName("row"), null, null);
           writer.add(newRowStart);

//start cell A
           Attribute attribute = eventFactory.createAttribute("t", "s");
           List attributeList = Arrays.asList(attribute);
           StartElement newCellStart = eventFactory.createStartElement(new QName("c"), attributeList.iterator(), null);
           writer.add(newCellStart);

           CTRst ctstr = CTRst.Factory.newInstance();
           ctstr.setT("new Row " + (rowsCount +1));
           int sRef = sharedstringstable.addEntry(ctstr);

           StartElement newCellValue = eventFactory.createStartElement(new QName("v"), null, null);
           writer.add(newCellValue);

           Characters value = eventFactory.createCharacters(Integer.toString(sRef));
           writer.add(value);         

           EndElement newCellValueEnd = eventFactory.createEndElement(new QName("v"), null);
           writer.add(newCellValueEnd);

           EndElement newCellEnd = eventFactory.createEndElement(new QName("c"), null);
           writer.add(newCellEnd);
//end cell A
//start cell B
           newCellStart = eventFactory.createStartElement(new QName("c"), null, null);
           writer.add(newCellStart);

           newCellValue = eventFactory.createStartElement(new QName("v"), null, null);
           writer.add(newCellValue);

           value = eventFactory.createCharacters(""+rowsCount+"."+(i+1)+""+((i+2>9)?0:i+2));
           writer.add(value);         

           newCellValueEnd = eventFactory.createEndElement(new QName("v"), null);
           writer.add(newCellValueEnd);

           newCellEnd = eventFactory.createEndElement(new QName("c"), null);
           writer.add(newCellEnd);
//end cell B

           EndElement newRowEnd = eventFactory.createEndElement(new QName("row"), null);
           writer.add(newRowEnd);

           rowsCount++;
          }
         }
        }
       }
      } while (rowStart);
     }
    }
   }

   writer.flush();

   //write the SharedStringsTable
   OutputStream out = sharedstringstablepart.getOutputStream();
   sharedstringstable.writeTo(out);
   out.close();

   opcpackage.close();

  } catch (Exception ex) {
     ex.printStackTrace();
  }
 }
}

This code also writes 10 new Rows in sheet1 of ReadAndWriteTest.xlsx without opening the whole workbook. But it must at least opening and parsing the SharedStringsTable. If even this fails, then this approach is also not usable. But of course even the SharedStringsTable could be streamed using StAX. But as you see in example with generating the rows and cells, this is much more complicated. So using the SharedStringsTable makes things easier in this example.



回答3:

(not enough reputation to add this as a comment) Have you tried using SXSSFWorkbook instead of XSSFWorkbook?