I need to be able to identify that a given file is an ODF file based on the contents of the file, and not on the file's extension.
ODF files are really a collection of XML files in a zip container, which means that I cannot use the file's magic number as it will just indicate that it is a zip file.
So what I'm really asking is are there any files that are required to be present in an ODF container? If so the presence of that file in a zip container indicates that it is likely to be an ODF file, and the absence of that file indicates that it definitely is not an ODF file.
Why not check out the ODF Technical Specification? The mimetype file listed there would probably be an ideal way to check (just look for the vnd.oasis.opendocument
string in the mimetype).
As I understand it, there will always be .xml file(s) in the root of the archive, and this/these xml files will always contain the string <office:document
very near the beginning.
All those I have seen seem to contain a file called "content.xml" in the root, which does contain this string.
There are not so many applications writing ODF documents, and in the past, there was basically just one. So it shouldn't be too difficult to install some ancient version of OpenOffice, save a few files, and check that this rule applies as it does on current ODF files.
I would test with something like this on a batch of know ODF files, to check if it is reliable:
$ unzip -c $FILE content.xml | grep -q '<office:document' && echo yes || echo NO
Read the Build ID - if missing, the document is not ODF.
oDoc = ThisComponent
If oDoc.BuildID = "" Then
bIsNotODF = TRUE
Endif