From Excel Embedded Object to Base64 String in XML

2019-07-25 02:46发布

问题:

I have an Excel sheet that allows users to click on specific cells and attach/embed files. These files are typically .pdf and .jpg format. I've read the Busy Developers guide on how to read embedded files using Apache POI, however I don't think I'm actually reading the correct file because when I go to save file locally or encode then decode for testing, the file says corrupt and will not open.

Here is some code:

for (PackagePart pPart : workbook.getAllEmbedds()) {
    InputStream inputStream = pPart.getInputStream();
    byte[] bytes = IOUtils.toByteArray(inputStream);
    inputStream.close();

    byte[] encoded = Base64.encodeBase64(bytes);

    attachmentFile.setValue(encoded);

    JAXBElement<Base64Binary> item = ncObjectFactory.createBinaryBase64Object(attachmentFile);

    attachment.getBinaryObject().add(item);
    attachment.getBinaryFormatID().add(idType);
    attachment.getBinaryDescriptionText().add(attachmentTextType);
    attachmentsType.getAttachment().add(attachment);

The above code gets it into base64 for my XML. However when I go to decode this in a test script, I am unable to open the files because the error in Adobe says that the file is corrupt or not saved correctly.

I get oleObject1.bin, or oleObject2.bin, or, oleObject3.bin, etc as I iterate through getAllEmbedds(). I believe this is the binary version of my embedded files, so how do I convert them back to their original format so they can be opened locally or on another machine?

My overall goal is to place embedded objects into an XML as Base64BinaryObjects, send XML to another system so it can pull those files out for review. My current issue is that once the files are retrieved from the XML, they won't open because they are corrupt/damaged/not correct format.

Update: Looking deeper into the oleObject.bin files, I see that some sort of wrapper is added to the original file. So there are bytes (?) added to the front and end of the original file. When I go to open the file in Adobe, I get that the file is corrupt since it can't find %PDF within the first 1024 bytes. So, I guess my question leads to - how do I remove the wrapper and/or the bytes at the beginning of the file?

回答1:

I was able to figure this out for oleObject.bin files. The problem is that the *.bin file was adding an OLE header to the original file and when I tried to read the file via Adobe, I got an error. So I had to either remove the added header or figure out how to get content without the header. Here's what worked for me:

POIFSFileSystem fs = new POIFSFileSystem(pPart.getInputStream());
TikaInputStream stream = null;
stream = TikaInputStream.get(fs.createDocumentInputStream("CONTENTS"));

bytes = IOUtils.toByteArray(stream);
String encoded = Base64.encodeBase64String(bytes);