extracting text from MS word files in python

2019-01-01 05:52发布

for working with MS word files in python, there is python win32 extensions, which can be used in windows. How do I do the same in linux? Is there any library?

14条回答
素衣白纱
2楼-- · 2019-01-01 05:52

If your intention is to use purely python modules without calling a subprocess, you can use the zipfile python modude.

content = ""
# Load DocX into zipfile
docx = zipfile.ZipFile('/home/whateverdocument.docx')
# Unpack zipfile
unpacked = docx.infolist()
# Find the /word/document.xml file in the package and assign it to variable
for item in unpacked:
    if item.orig_filename == 'word/document.xml':
        content = docx.read(item.orig_filename)

    else:
        pass

Your content string however needs to be cleaned up, one way of doing this is:

# Clean the content string from xml tags for better search
fullyclean = []
halfclean = content.split('<')
for item in halfclean:
    if '>' in item:
        bad_good = item.split('>')
        if bad_good[-1] != '':
            fullyclean.append(bad_good[-1])
        else:
            pass
    else:
        pass

# Assemble a new string with all pure content
content = " ".join(fullyclean)

But there is surely a more elegant way to clean up the string, probably using the re module. Hope this helps.

查看更多
谁念西风独自凉
3楼-- · 2019-01-01 05:53

If you have LibreOffice installed, you can simply call it from the command line to convert the file to text, then load the text into Python.

查看更多
不再属于我。
4楼-- · 2019-01-01 05:56

Use the native Python docx module. Here's how to extract all the text from a doc:

document = docx.Document(filename)
docText = '\n\n'.join([
    paragraph.text.encode('utf-8') for paragraph in document.paragraphs
])
print docText

See Python DocX site

Also check out Textract which pulls out tables etc.

Parsing XML with regexs invokes cthulu. Don't do it!

查看更多
几人难应
5楼-- · 2019-01-01 05:58

(Note: I posted this on this question as well, but it seems relevant here, so please excuse the repost.)

Now, this is pretty ugly and pretty hacky, but it seems to work for me for basic text extraction. Obviously to use this in a Qt program you'd have to spawn a process for it etc, but the command line I've hacked together is:

unzip -p file.docx | grep '<w:t' | sed 's/<[^<]*>//g' | grep -v '^[[:space:]]*$'

So that's:

unzip -p file.docx: -p == "unzip to stdout"

grep '<w:t': Grab just the lines containing '<w:t' (<w:t> is the Word 2007 XML element for "text", as far as I can tell)

sed 's/<[^<]>//g'*: Remove everything inside tags

grep -v '^[[:space:]]$'*: Remove blank lines

There is likely a more efficient way to do this, but it seems to work for me on the few docs I've tested it with.

As far as I'm aware, unzip, grep and sed all have ports for Windows and any of the Unixes, so it should be reasonably cross-platform. Despit being a bit of an ugly hack ;)

查看更多
一个人的天荒地老
6楼-- · 2019-01-01 06:01

You could make a subprocess call to antiword. Antiword is a linux commandline utility for dumping text out of a word doc. Works pretty well for simple documents (obviously it loses formatting). It's available through apt, and probably as RPM, or you could compile it yourself.

查看更多
零度萤火
7楼-- · 2019-01-01 06:01

Take a look at how the doc format works and create word document using PHP in linux. The former is especially useful. Abiword is my recommended tool. There are limitations though:

However, if the document has complicated tables, text boxes, embedded spreadsheets, and so forth, then it might not work as expected. Developing good MS Word filters is a very difficult process, so please bear with us as we work on getting Word documents to open correctly. If you have a Word document which fails to load, please open a Bug and include the document so we can improve the importer.

查看更多
登录 后发表回答