Python-based document metadata parser?

2019-04-11 08:52发布

Does anyone know a good parser for document metadata in python for unix like systems. In Java, apache tika is great.

No com ... please :)

Thanks

4条回答
女痞
2楼-- · 2019-04-11 08:56

You don't have to use Jython to use Tika. You can call Java from Python using JCC. You can find decent instructions for this here.

When installing JCC you'll have to use one of two provided patches for setuptools, so it can build shared objects. The c7 version worked for me on Ubuntu 10.04.

Another option would be to use the python subprocess module to call and capture the stdout of Tika.

查看更多
三岁会撩人
3楼-- · 2019-04-11 08:57

If you like tika, you could always use Jython so you can reference tika directly.

查看更多
forever°为你锁心
4楼-- · 2019-04-11 09:05

Tika seems like a great option. It's the only tool I've found (apart from OpenOffice in server mode) which supports old-style XLS files. I've done some work on making it easier to integrate Tika into a Python project, which you can find in this blog post.

查看更多
我命由我不由天
5楼-- · 2019-04-11 09:10

hachoir_metadata works great with excel documents http://bitbucket.org/haypo/hachoir/wiki/Home

查看更多
登录 后发表回答