Find and replace text in .docx file - Python

2019-02-03 10:28发布

问题:

I've been doing a lot of searching for a method to find and replace text in a docx file with little luck. I've tried the docx module and could not get that to work. Eventually I worked out the method described below using the zipfile module and replacing the document.xml file in the docx archive. For this to work you need a template document (docx) with the text you want to replace as unique strings that could not possibly match any other existing or future text in the document (eg. "The meeting with XXXCLIENTNAMEXXX on XXXMEETDATEXXX went very well.").

import zipfile

replaceText = {"XXXCLIENTNAMEXXX" : "Joe Bob", "XXXMEETDATEXXX" : "May 31, 2013"}
templateDocx = zipfile.ZipFile("C:/Template.docx")
newDocx = zipfile.ZipFile("C:/NewDocument.docx", "a")

with open(templateDocx.extract("word/document.xml", "C:/")) as tempXmlFile:
    tempXmlStr = tempXmlFile.read()

for key in replaceText.keys():
    tempXmlStr = tempXmlStr.replace(str(key), str(replaceText.get(key)))

with open("C:/temp.xml", "w+") as tempXmlFile:
    tempXmlFile.write(tempXmlStr)

for file in templateDocx.filelist:
    if not file.filename == "word/document.xml":
        newDocx.writestr(file.filename, templateDocx.read(file))

newDocx.write("C:/temp.xml", "word/document.xml")

templateDocx.close()
newDocx.close()

My question is what's wrong with this method? I'm pretty new to this stuff, so I feel someone else should have figured this out already. Which leads me to believe there is something very wrong with this approach. But it works! What am I missing here?

.

Here is a walkthrough of my thought process for everyone else trying to learn this stuff:

Step 1) Prepare a Python dictionary of the text strings you want to replace as keys and the new text as items (eg. {"XXXCLIENTNAMEXXX" : "Joe Bob", "XXXMEETDATEXXX" : "May 31, 2013"}).

Step 2) Open the template docx file using the zipfile module.

Step 3) Open a new new docx file with the append access mode.

Step 4) Extract the document.xml (where all the text lives) from the template docx file and read the xml to a text string variable.

Step 5) Use a for loop to replace all of the text defined in your dictionary in the xml text string with your new text.

Step 6) Write the xml text string to a new temporary xml file.

Step 7) Use a for loop and the zipfile module to copy all of the files in the template docx archive to a new docx archive EXCEPT the word/document.xml file.

Step 8) Write the temporary xml file with the replaced text to the new docx archive as a new word/document.xml file.

Step 9) Close your template and new docx archives.

Step 10) Open your new docx document and enjoy your replaced text!

--Edit-- Missing closing parentheses ')' on lines 7 and 11

回答1:

Sometimes, Word does strange things. You should try to remove the text and rewrite it in one stroke, eg without editing the text in the middle

Your document is saved in a xml file (usually in word/document.xml for docx, afer unzipping). Sometimes it is possible that your text won't be in one stroke: it is possible that somewhere in the document, they is XXXCLIENT and somewhere else they is NAMEXXX.

Something like this:

<w:t> XXXCLIENT </w:t> ... <w:t> NAMEXXX </w:t>

This happens quite often because of language support : word splits words when he thinks that one word is of one specific language, and may do so between words, that will split the words into multiple tags.

Only problem with your solution is that you have to write everything in one stroke, which isn't the most user-friendly.

I have created a JS Library that uses mustache like tags: {clientName} https://github.com/edi9999/docxgenjs

It works globally the same as your algorithm but won't crash if the content is not in one stroke (when you write {clientName} in Word, the text will usually be splitted: {, clientName, } in the document.



回答2:

You can try a workaround. Use Word's search/replace to get the text in one stroke.

For example search for "XXXCLIENTNAMEXXX" and replace it again with "XXXCLIENTNAMEXXX".