I have several .docx files that contain a number of similar blocks of text: docx files that contain 300+ press releases that are 1-2 pages each, that need to be separated into individual text files. The only consistent way to tell differences between articles is that there is always and only a page break between 2 articles.
However, I don't know how to find page breaks when converting the encompassing Word documents to text, and the page break information is lost after the conversion using my current script
I want to know how to preserve HARD page breaks when converting a .docx file to .txt. It doesn't matter to me what they look like in the text file, as long as they're uniquely identifiable when scanning the text file later
Here is the script I am using to convert the docx files to txt:
def docx2txt(file_path):
document = opendocx(file_path)
text_file = open("%s.txt" % file_path[:len(file_path)-5], "w")
paratextlist = getdocumenttext(document)
newparatextlist = []
for paratext in paratextlist:
newparatextlist.append(paratext.encode("utf-8"))
text_file.write('\n\n'.join(newparatextlist))
text_file.close()
A hard page break will appear as a
<w:br>
element within a run element (<w:r>
), something like this:So one approach would be to replace all those occurrences with a distinctive string of text, like maybe "{{foobar}}".
An implementation of that would be something like this:
I don't have time to test this, so you might run into some missing imports or whatever, but everything you need should already be in the docx module. The rest is
lxml
method calls on _Element.Let me know how you go and I can tweak this if needed.