i'm currently working with Excel files (*.xlsm) and Apache POI , and i have been cracking my head over a task. I receive some excel files that have PDFs embedded in it and i want to extract them and rename them based on the row and column they are in. This seems weird as i know the embedded objects are represented as images ,they can occupy more than one cell and technically they are not "In" the cell.
The following code snippet lets me extract the embedded PDFs but they are named OleObject[1..2..3.etc..] wich doesnt give me any clue.
inStream = new FileInputStream(file);
XSSFWorkbook workbook = new XSSFWorkbook(inStream);
for (PackagePart pPart : workbook.getAllEmbedds()) {
String contentType = pPart.getContentType();
if (contentType.equals("application/vnd.openxmlformats-officedocument.oleObject")){
POIFSFileSystem fs = new POIFSFileSystem(pPart.getInputStream());
TikaInputStream stream = TikaInputStream.get(fs.createDocumentInputStream("CONTENTS"));
byte[] bytes = IOUtil.toByteArray(stream);
OutputStream outStream = new FileOutputStream(new File(ROOT_DIRECTORY.getAbsolutePath()+"\\PDF"+i+".pdf"));
IOUtil.copy(bytes, outStream);
I wanted to know if org.openxmlformats.schemas.spreadsheetml.x2006.main.CTWorksheet will let me see the xml code of the excell sheet and maybe eith taht i can get the info i need. Like this.
<oleObjects><mc:AlternateContent xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006"><mc:Choice Requires="x14"><oleObject progId="Acrobat Document" dvAspect="DVASPECT_ICON" shapeId="1028" r:id="rId4"><objectPr defaultSize="0" r:id="rId5"><anchor moveWithCells="1"><from><xdr:col>8</xdr:col><xdr:colOff>0</xdr:colOff><xdr:row>11</xdr:row><xdr:rowOff>0</xdr:rowOff></from><to><xdr:col>8</xdr:col><xdr:colOff>1143000</xdr:colOff><xdr:row>13</xdr:row><xdr:rowOff>171450</xdr:rowOff></to></anchor></objectPr></oleObject></mc:Choice><mc:Fallback><oleObject progId="Acrobat Document" dvAspect="DVASPECT_ICON" shapeId="1028" r:id="rId4"/></mc:Fallback></mc:AlternateContent></oleObjects>
<objectPr defaultSize="0" r:id="rId5"><anchor moveWithCells="1"><from><xdr:col>8</xdr:col><xdr:colOff>0</xdr:colOff><xdr:row>11</xdr:row><xdr:rowOff>0</xdr:rowOff></from><to><xdr:col>8</xdr:col><xdr:colOff>1143000</xdr:colOff><xdr:row>13</xdr:row><xdr:rowOff>171450</xdr:rowOff></to></anchor></objectPr>
I guess using the anchor information would be possible but im just unable to find how to get it.
Hope this information makes things clear on what im trying to do .
Thanks in advance.
I've looked at the source code for the current poi-ooxml-schemas sources jars which you can locate here: http://repo1.maven.org/maven2/org/apache/poi/ooxml-schemas/1.3/
org.openxmlformats.schemas.spreadsheetml.x2006.main.CTWorksheet extends org.apache.xmlbeans.XmlObject which can give you the XML as a string using the inherited .toString() method. Or you can quickly access the list of OLE objects in the worksheet by calling getOleObjects() on your CTWorksheet object.
CTOleObjects itself extends org.apache.xmlbeans.XmlObject and again you can get the XML using toString() for parsing, or get a list of org.openxmlformats.schemas.spreadsheetml.x2006.main.CTOleObject OLE objects for iteration using CTOleObjects.getOleObjectList().
CTOleObject doesn't seem to have getter methods to get the and child XML elements to allow you to determine the columns, so I think you would need to do some XML parsing, or string searching to get this info if it is contained in the string XML representation.
Hope this helps.