Reading property sets from Office 2007+ documents

2019-04-13 08:47发布

问题:

I have tried to read property sets from Office 2007+ documents (docx, xlsx). Found the amazing solution on http://poi.apache.org/hpsf/how-to.html. There is an example for Office 2003 and early format (doc, xls, without "x").

public class ReadSummaryInformation {
    public static void main(final String[] args) throws IOException {
        final String filename = "C://file.docx";
        POIFSReader r = new POIFSReader();
        r.registerListener(new MyPOIFSReaderListener(),
                           "\005SummaryInformation");
        r.read(new FileInputStream(filename));
    }

    static class MyPOIFSReaderListener implements POIFSReaderListener {
        public void processPOIFSReaderEvent(final POIFSReaderEvent event)
        {
            SummaryInformation si = null;
            try {
                si = (SummaryInformation)
                    PropertySetFactory.create(event.getStream());
            }
            catch (Exception ex){
                throw new RuntimeException
                    ("Property set stream \"" +
                     event.getPath() + event.getName() + "\": " + ex);
            }
            final String title = si.getTitle();
            if (title != null)
                System.out.println("Title: \"" + title + "\"");
            else
                System.out.println("Document has no title.");
        }
    }
}

I tried to open docx and xlsx (meaning that I tried to read the "\005SummaryInformation" from the documents) with this code, and guess what? I got the exception:

Exception in thread "main" org.apache.poi.poifs.filesystem.OfficeXmlFileException: 
The supplied data appears to be in the Office 2007+ XML. [b]You are calling the part
of POI that deals with OLE2 Office Documents.[/b] You need to call a different part of 
POI to process this data (eg XSSF instead of HSSF)

Mister http://poi.apache.org/ states loud and clear that:

Office OpenXML Format is the new standards based XML file format found in Microsoft Office 2007 and 2008. This includes XLSX, DOCX and PPTX. The project provides a low level API to support the Open Packaging Conventions using openxml4j.

Then I got to poi's api and I found out that HPSF has PropertySet which is the actual class that access the metadata I want, but XSSF doesn't. It's just one of the explanation that I found for the exception.

My question is: can I read this marvelous "\005SummaryInformation" from Office 2007+ files with POI? I have a string feeling that the authors of the source code left the api structure in the air and started a new one when the Office 2007 format came out.

Thank you in advance!


I tried to do that but I got an exception:

try {
   OPCPackage pkg = OPCPackage.open(new FileInputStream(new File("D:\\file.docx")));
   POIXMLProperties props;
   props = new POIXMLProperties(pkg);
   System.out.println("The title is " + props.getCoreProperties().getTitle());
} catch (Exception e) {
   // TODO Auto-generated catch block
   e.printStackTrace();
}

Exception in thread "main" java.lang.NoClassDefFoundError: org/dom4j/DocumentException
       at org.apache.poi.openxml4j.opc.OPCPackage.init(OPCPackage.java:154)
       at org.apache.poi.openxml4j.opc.OPCPackage.<init>(OPCPackage.java:141)
       at org.apache.poi.openxml4j.opc.Package.<init>(Package.java:54)
       at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:82)
       at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:267)
       at ReadSummaryInformation.main(ReadSummaryInformation.java:38)
Caused by: java.lang.ClassNotFoundException: org.dom4j.DocumentException
       at java.net.URLClassLoader$1.run(Unknown Source)
       at java.security.AccessController.doPrivileged(Native Method)
       at java.net.URLClassLoader.findClass(Unknown Source)
       at java.lang.ClassLoader.loadClass(Unknown Source)
       at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
       at java.lang.ClassLoader.loadClass(Unknown Source)
       ... 6 more

My classpath looks like this:

  .;C:\Program Files (x86)\Java\jre6\lib\ext\QTJava.zip;D:\kituri\Java\JDBC
   driver\mysql-connector-java-5.1.22\mysql-connector-java-5.1.22-bin.jar;%JAVA_HOME%
   \lib;%XMLBEANS_HOME%\lib\xbean.jar;D:\work\Workspace\document_archive01-2212
   \src\RunClass.java;D:\work\Workspace\document_archive01-2212\poi-3.9\ooxml-
   lib\dom4j-1.6.1.jar

And my path looks like this:

 C:\oraclexe\app\oracle\product\11.2.0\server\bin;;C:\Oracle11g\product\11.2.0\dbhome_1
 \bin;%SystemRoot%\system32;%SystemRoot%;%SystemRoot%\System32\Wbem;%SYSTEMROOT%
 \System32\WindowsPowerShell\v1.0\;C:\Program Files (x86)\ATI Technologies\ATI.ACE
 \Core-Static;C:\Program Files\WIDCOMM\Bluetooth Software\;C:\Program Files\WIDCOMM
 \Bluetooth Software\syswow64;C:\Program Files (x86)\QuickTime\QTSystem\;C:\Program 
 Files (x86)\Java\apache-maven-3.0.4\bin;C:\Program Files (x86)\Java\jdk1.7.0_07\bin;D:
 \ChromeDriver;%XMLBEANS_HOME%\bin
  • poi-3.9-20121203.jar
  • xbean.jar
  • poi-ooxml-3.9-20121203.jar are imported in the project and set as buildpath.

I tried to find the problem for 4 days (a.k.a. reimporting the libraries and setting the path variable) but I got dizzy and I don't really have time to deal with this problem that doesn't seems to be clear at all. I checked even the integrity of the libraries imported (I assured that the .class files are present in jars).

回答1:

The properties in an OOXML file are similar, but not quite identical to their OLE2 cousins. So, you can't use the HPSF SummaryInformation code directly, but there's something similar

The class you'll want is POIXMLProperties, something like:

OPCPackage pkg = OPCPackage.open(new File("file.xlsx"));
POIXMLProperties props = new POIXMLProperties(pkg);
System.out.println("The title is " + props.getCorePart().getTitle());

From POIXMLProperties you can get access to all the built-in properties, and the custom ones too!

(Note that to work with OOXML files, you need some additional Jars on your classpath. The Apache POI Components page has all the details)