Python-based document metadata parser?

2019-04-11 08:52发布

Does anyone know a good parser for document metadata in python for unix like systems. In Java, apache tika is great.

No com ... please :)

Thanks

标签： python parsing

4条回答

女痞

2楼-- · 2019-04-11 08:56

You don't have to use Jython to use Tika. You can call Java from Python using JCC. You can find decent instructions for this here.

When installing JCC you'll have to use one of two provided patches for setuptools, so it can build shared objects. The c7 version worked for me on Ubuntu 10.04.

Another option would be to use the python subprocess module to call and capture the stdout of Tika.

0人赞添加讨论(0) 举报

三岁会撩人

3楼-- · 2019-04-11 08:57

If you like tika, you could always use Jython so you can reference tika directly.

0人赞添加讨论(0) 举报

forever°为你锁心

4楼-- · 2019-04-11 09:05

Tika seems like a great option. It's the only tool I've found (apart from OpenOffice in server mode) which supports old-style XLS files. I've done some work on making it easier to integrate Tika into a Python project, which you can find in this blog post.

0人赞添加讨论(0) 举报

我命由我不由天

5楼-- · 2019-04-11 09:10

hachoir_metadata works great with excel documents http://bitbucket.org/haypo/hachoir/wiki/Home

0人赞添加讨论(0) 举报

Python-based document metadata parser?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间