I have a web project where I must import text and images from a user-supplied document, and one of the possible formats is Microsoft Office 2007. There's also a need to generate documents in this format.
The server runs CentOS 5.2 and has PHP/Perl/Python installed. I can execute local binaries and shell scripts if I must. We use Apache 2.2 but will be switching over to Nginx once it goes live.
What are my options? Anyone had experience with this?
The python docx module can generate formatted Microsoft office docx files from pure Python. Out of the box, it does headers, paragraphs, tables, and bullets, but the makeelement() module can be extended to do arbitrary elements like images.
You can probably check the code for Sphider. They docs and pdfs, so I'm sure they can read them. Might also lead you in the right direction for other Office formats.
I have successfully used the OpenXML Format SDK in a project to modify an Excel spreadsheet via code. This would require .NET and I'm not sure about how well it would work under Mono.
The Office 2007 file formats are open and well documented. Roughly speaking, all of the new file formats ending in "x" are zip compressed XML documents. For example:
The other file formats are roughly similar. I don't know of any open source libraries for interacting with them as yet - but depending on your exact requirements, it doesn't look too difficult to read and write simple documents. Certainly it should be a lot easier than with the older formats.
If you need to read the older formats, OpenOffice has an API and can read and write Office 2003 and older documents with more or less success.