A customer is asking me to build a module for his running webapp that can load docx files and extract data based on the Headings found in the document. I know docx is just a zip file and most of what I need can be found in word/document.xml, though I'm not looking forward to parsing lists/styles/images/tables and whatever other things that need to be translated from OOXML to HTML.
Are there any PHP libraries for this format? I do need some sort of flexibility though: just an OOXML to HTML converter is not going to cut it, I need to break the document up in parts.
If it's purely docx, you can try phpdocx... don't know if it reads or only writes. PHPWord doesn't yet read, only writes (though I'm working on it).
If you only need the properties information, then you'll find it all within the /docProps/core.xml file within the zip (and possibly in /docProps/app.xml depending on exactly which properties you need), so you can bypass most of the files that hold text, style, images, etc. For verification of file names, [Content_Types].xml holds the filenames for the core and app properties files as application/vnd.openxmlformats-officedocument.spreadsheetml.sheet.main+xml and application/vnd.openxmlformats-officedocument.extended-properties+xml
EDIT:
If you need headings, then you will need to parse the document, not just the properties. That will mean identifying the heading styles, and parsing the text for entities with those styles.
Codeplex has a number of libraries than can work with MS Office documents:
- http://www.codeplex.com/site/search?query=ooxml
With the exception of PHPExcel, I do not know how mature those projects are. If there is nothing to help you out there, you can still use DOM.
OpenTBS can read and modify DOCX (and other OpenXML files) documents in PHP using the technique of templates.
No temporary files needed, no command lines, all in PHP.
But if you only need to read a part of the DOCX file, then you can use the class TbsZip. It can read zip archives (as any OpenXML files, DOCX is a zip archive containing mostly XML files).
In DOCX files, the headers and footers sub-files are usually "/word/header1.xml" and "/word/footer1.xml".
They exists only if header/footer is defined.
There also may have an optional couple of XML sub-files for odd numbered pages (usually "/word/header2.xml" and "/word/footer2.xml").
And an optional couple of sub-files for the first page (usually "/word/header3.xml" and "/word/footer3.xml").
http://www.tinybutstrong.com/opentbs.php
You could also use this libraries https://poi.apache.org/
and connect them through php java bridge http://php-java-bridge.sourceforge.net/pjb/
- install a tomcat server
- place java bridge in the webapps folder and add the poi libraries
- then you could use this libraries to extract the heading styles.
The API is well documented and you have many options.
A PHP library that does this would be the better, but you can try this approach if it works for you or somebody else