I've a docx file, this contains a lot of new lines between sections, I need to clear a new line when it appears on more than one occasion consecutively. I unzip the file using:
z = zipfile.ZipFile('File.docx','a')
z.extractall()
Inside of a directory: word, is a file document.xml, this contains all the data, but i don't get how to know in xml where's a new line.
I Know that extract it is not the solution (I use here just only to show where is the file). I think i can use:
z.write('Document.xml')
Can anyone help me?
The code from tlewis is for finding a particular text from the docx and replace it. In your case, there's something else to do: detect the new lines, and see if they are more than two new lines in a row. In word, a newline is just a paragraph (<w:p>
tag) without any text inside.
I have added some comments that will show you how to use the zip.
import zipfile #Import the zip Module
from lxml import etree #Useful to transform string into xml, and xml into string
templateDocx = zipfile.ZipFile("C:/Template.docx") #Here is the path to the file you want to import
newDocx = zipfile.ZipFile("C:/NewDocument.docx", "a") #This is the name of the outputed file
#Open the document.xml file, the file that contains the content
with open(templateDocx.extract("word/document.xml", "C:/") as tempXmlFile:
tempXmlStr = tempXmlFile.read()
tempXmlXml= etree.fromstring(tempXmlStr) #Convert the string into XML
############
# Algorithm detailled at the bottom,
# You have to write here the code to select all <w:p> tags, look if there is a <w:t> tag.
############
tempXmlStr = etree.tostring(tempXmlXml, pretty_print=True) # Convert the changed XML into a string
with open("C:/temp.xml", "w+") as tempXmlFile:
tempXmlFile.write(tempXmlStr) #Write the changed file
for file in templateDocx.filelist:
if not file.filename == "word/document.xml":
newDocx.writestr(file.filename, templateDocx.read(file)) #write all files except the changed ones in the zipArchive
newDocx.write("C:/temp.xml", "word/document.xml") #write the document.xml file
templateDocx.close() #Close both template And new Docx
newDocx.close() # Close
How to write the algorithm to remove the multiple new lines
Here is a Sample Doc I have Created:
Here is the corresponding code of document.xml:
<w:p w:rsidR="006C517B" w:rsidRDefault="00761A87">
<w:bookmarkStart w:id="0" w:name="_GoBack" />
<w:bookmarkEnd w:id="0" />
<w:r>
<w:t>First Line</w:t>
</w:r>
</w:p>
<w:p w:rsidR="00761A87" w:rsidRDefault="00761A87" />
<w:p w:rsidR="00761A87" w:rsidRDefault="00761A87">
<w:proofErr w:type="spellStart" />
<w:r>
<w:t>Third</w:t>
</w:r>
<w:proofErr w:type="spellEnd" />
<w:r>
<w:t xml:space="preserve"> Line</w:t>
</w:r>
</w:p>
<w:p w:rsidR="00761A87" w:rsidRDefault="00761A87" />
<w:p w:rsidR="00761A87" w:rsidRDefault="00761A87" />
<w:p w:rsidR="00761A87" w:rsidRDefault="00761A87">
<w:r>
<w:t>Six Line</w:t>
</w:r>
</w:p>
<w:p w:rsidR="00761A87" w:rsidRDefault="00761A87" />
<w:p w:rsidR="00761A87" w:rsidRDefault="00761A87" />
<w:p w:rsidR="00761A87" w:rsidRDefault="00761A87" />
<w:p w:rsidR="00761A87" w:rsidRDefault="00761A87">
<w:proofErr w:type="spellStart" />
<w:r>
<w:t>Ten</w:t>
</w:r>
<w:proofErr w:type="spellEnd" />
<w:r>
<w:t xml:space="preserve"> Line</w:t>
</w:r>
</w:p>
<w:p w:rsidR="00761A87" w:rsidRDefault="00761A87">
<w:proofErr w:type="spellStart" />
<w:r>
<w:t>Eleven</w:t>
</w:r>
<w:proofErr w:type="spellEnd" />
<w:r>
<w:t xml:space="preserve"> Line</w:t>
</w:r>
</w:p>
As you can see, a new line is a empty <w:p>
, like this one:
<w:p w:rsidR="00761A87" w:rsidRDefault="00761A87" />
To remove the multiple new Lines, check if they are multiple empty <w:p>
, and remove all but the first.
Hope that helps!
From here:
import zipfile
replaceText = {"XXXCLIENTNAMEXXX" : "Joe Bob", "XXXMEETDATEXXX" : "May 31, 2013"}
templateDocx = zipfile.ZipFile("C:/Template.docx")
newDocx = zipfile.ZipFile("C:/NewDocument.docx", "a")
with open(templateDocx.extract("word/document.xml", "C:/") as tempXmlFile:
tempXmlStr = tempXmlFile.read()
for key in replaceText.keys():
tempXmlStr = tempXmlStr.replace(str(key), str(replaceText.get(key))
with open("C:/temp.xml", "w+") as tempXmlFile:
tempXmlFile.write(tempXmlStr)
for file in templateDocx.filelist:
if not file.filename == "word/document.xml":
newDocx.writestr(file.filename, templateDocx.read(file))
newDocx.write("C:/temp.xml", "word/document.xml")
templateDocx.close()
newDocx.close()
Explanation:
Step 1) Prepare a Python dictionary of the text strings you want to replace as keys and the new text as items (eg. {"XXXCLIENTNAMEXXX" : "Joe Bob", "XXXMEETDATEXXX" : "May 31, 2013"}).
Step 2) Open the template docx file using the zipfile module.
Step 3) Open a new new docx file with the append access mode.
Step 4) Extract the document.xml (where all the text lives) from the template docx file and read the xml to a text string variable.
Step 5) Use a for loop to replace all of the text defined in your dictionary in the xml text string with your new text.
Step 6) Write the xml text string to a new temporary xml file.
Step 7) Use a for loop and the zipfile module to copy all of the files in the template docx archive to a new docx archive EXCEPT the word/document.xml file.
Step 8) Write the temporary xml file with the replaced text to the new docx archive as a new word/document.xml file.
Step 9) Close your template and new docx archives.
Step 10) Open your new docx document and enjoy your replaced text!