extracting text from MS word files in python

2019-01-01 05:52发布

for working with MS word files in python, there is python win32 extensions, which can be used in windows. How do I do the same in linux? Is there any library?

14条回答
长期被迫恋爱
2楼-- · 2019-01-01 06:16

benjamin's answer is a pretty good one. I have just consolidated...

import zipfile, re

docx = zipfile.ZipFile('/path/to/file/mydocument.docx')
content = docx.read('word/document.xml').decode('utf-8')
cleaned = re.sub('<(.|\n)*?>','',content)
print(cleaned)
查看更多
爱死公子算了
3楼-- · 2019-01-01 06:17

OpenOffice.org can be scripted with Python: see here.

Since OOo can load most MS Word files flawlessly, I'd say that's your best bet.

查看更多
登录 后发表回答