How to identify page breaks using python-docx from

I have several .docx files that contain a number of similar blocks of text: docx files that contain 300+ press releases that are 1-2 pages each, that need to be separated into individual text files. The only consistent way to tell differences between articles is that there is always and only a page break between 2 articles.

However, I don't know how to find page breaks when converting the encompassing Word documents to text, and the page break information is lost after the conversion using my current script

I want to know how to preserve HARD page breaks when converting a .docx file to .txt. It doesn't matter to me what they look like in the text file, as long as they're uniquely identifiable when scanning the text file later

Here is the script I am using to convert the docx files to txt:

def docx2txt(file_path):
    document = opendocx(file_path)
    text_file = open("%s.txt" % file_path[:len(file_path)-5], "w")
    paratextlist = getdocumenttext(document)
    newparatextlist = []
    for paratext in paratextlist:
        newparatextlist.append(paratext.encode("utf-8"))
    text_file.write('\n\n'.join(newparatextlist))
    text_file.close()

标签： python parsing docx page-break python-docx

1条回答

冷血范

2楼-- · 2019-02-14 15:41

A hard page break will appear as a <w:br> element within a run element (<w:r>), something like this:

<w:p>
  <w:r>
    <w:t>some text</w:t>
    <w:br w:type="page"/>
  </w:r>
</w:p>

So one approach would be to replace all those occurrences with a distinctive string of text, like maybe "{{foobar}}".

An implementation of that would be something like this:

from lxml import etree
from docx import nsprefixes

page_br_elements = document.xpath(
    "//w:p/w:r/w:br[@w:type='page']", namespaces={'w': nsprefixes['w']}
)
for br in page_br_elements:
    t = etree.Element('w:t', nsmap={'w': nsprefixes['w']})
    t.text = '{{foobar}}'
    br.addprevious(t)
    parent = br.getparent()
    parent.remove(br)

I don't have time to test this, so you might run into some missing imports or whatever, but everything you need should already be in the docx module. The rest is lxml method calls on _Element.

Let me know how you go and I can tweak this if needed.

0人赞添加讨论(0) 举报

How to identify page breaks using python-docx from

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间