I have several .docx files that contain a number of similar blocks of text: docx files that contain 300+ press releases that are 1-2 pages each, that need to be separated into individual text files. The only consistent way to tell differences between articles is that there is always and only a page break between 2 articles.
However, I don't know how to find page breaks when converting the encompassing Word documents to text, and the page break information is lost after the conversion using my current script
I want to know how to preserve HARD page breaks when converting a .docx file to .txt. It doesn't matter to me what they look like in the text file, as long as they're uniquely identifiable when scanning the text file later
Here is the script I am using to convert the docx files to txt:
def docx2txt(file_path):
document = opendocx(file_path)
text_file = open("%s.txt" % file_path[:len(file_path)-5], "w")
paratextlist = getdocumenttext(document)
newparatextlist = []
for paratext in paratextlist:
newparatextlist.append(paratext.encode("utf-8"))
text_file.write('\n\n'.join(newparatextlist))
text_file.close()