How can I detect images in a document say doc,xls,ppt or pdf ?
I came across with Apache Tika, I am trying its command line option.
http://tika.apache.org/1.2/gettingstarted.html
But not quite sure how it will detect images.
Any help is appreciated.
Thanks
You've said you want to use a command line solution, and not write any Java code, so it's not going to be the prettiest way to do it... If you are happy to write a little bit of Java, and create a new program to call from Python, then you can do it much nicer!
The first thing to do is to have the Tika App extract out any embedded resources within your file. Use the --extract
option for this, and have the extraction occur in a special temp directory you app controls, eg
$ java -jar tika.jar --extract ../testWORD_embedded_pdf.doc
Extracting 'image1.emf' (application/x-emf)
Extracting '_1402837031.pdf' (application/pdf)
Grab the output of the extraction if you can, and parse that looking for images (but be aware that some images have an application/
prefix on their canconical mimetype!). You might need to run a second --detect step on a few, I'm not sure, test how the parsers get on with the extraction.
Now, if there were images, they'll be in your test dir. Process them as you want. Finally, zap the temp dir when you're done with the file!
Having used Tika in the past I can't see how Tika can help with images embedded within Office documents or PDFs I was wrong to answer No. You will have may still try to resolve to native APIs like Apache POI and Apache PDFBox. Tika does use both libraries to parse text and metadata but no embedded image support.
Using Tika makes these APIs automatically available (side effect of using Tika).
UPDATE:
Since Tika 0.8: look for EmbeddedResourceHandler and examples - thanks to Gagravarr.