Does anyone know a good parser for document metadata in python for unix like systems. In Java, apache tika is great.
No com ... please :)
Thanks
Does anyone know a good parser for document metadata in python for unix like systems. In Java, apache tika is great.
No com ... please :)
Thanks
You don't have to use Jython to use Tika. You can call Java from Python using JCC. You can find decent instructions for this here.
When installing JCC you'll have to use one of two provided patches for setuptools, so it can build shared objects. The c7 version worked for me on Ubuntu 10.04.
Another option would be to use the python subprocess module to call and capture the stdout of Tika.
If you like tika, you could always use Jython so you can reference tika directly.
Tika seems like a great option. It's the only tool I've found (apart from OpenOffice in server mode) which supports old-style XLS files. I've done some work on making it easier to integrate Tika into a Python project, which you can find in this blog post.
hachoir_metadata works great with excel documents http://bitbucket.org/haypo/hachoir/wiki/Home