How to identify page breaks using python-docx from

2019-02-14 15:11发布

问题:

I have several .docx files that contain a number of similar blocks of text: docx files that contain 300+ press releases that are 1-2 pages each, that need to be separated into individual text files. The only consistent way to tell differences between articles is that there is always and only a page break between 2 articles.

However, I don't know how to find page breaks when converting the encompassing Word documents to text, and the page break information is lost after the conversion using my current script

I want to know how to preserve HARD page breaks when converting a .docx file to .txt. It doesn't matter to me what they look like in the text file, as long as they're uniquely identifiable when scanning the text file later

Here is the script I am using to convert the docx files to txt:

def docx2txt(file_path):
    document = opendocx(file_path)
    text_file = open("%s.txt" % file_path[:len(file_path)-5], "w")
    paratextlist = getdocumenttext(document)
    newparatextlist = []
    for paratext in paratextlist:
        newparatextlist.append(paratext.encode("utf-8"))
    text_file.write('\n\n'.join(newparatextlist))
    text_file.close()

回答1:

A hard page break will appear as a <w:br> element within a run element (<w:r>), something like this:

<w:p>
  <w:r>
    <w:t>some text</w:t>
    <w:br w:type="page"/>
  </w:r>
</w:p>

So one approach would be to replace all those occurrences with a distinctive string of text, like maybe "{{foobar}}".

An implementation of that would be something like this:

from lxml import etree
from docx import nsprefixes

page_br_elements = document.xpath(
    "//w:p/w:r/w:br[@w:type='page']", namespaces={'w': nsprefixes['w']}
)
for br in page_br_elements:
    t = etree.Element('w:t', nsmap={'w': nsprefixes['w']})
    t.text = '{{foobar}}'
    br.addprevious(t)
    parent = br.getparent()
    parent.remove(br)

I don't have time to test this, so you might run into some missing imports or whatever, but everything you need should already be in the docx module. The rest is lxml method calls on _Element.

Let me know how you go and I can tweak this if needed.