Finding implicit page break in word document using

2019-07-29 22:34发布

问题:

I need to extract the first page content of a word document. If I look at the openxml for a wordML document I could see things like: <w:lastRenderedPageBreak /> or it would seem <w:br w:type="page" /> <w:br w:type="page" /> occurs when user enters an hard page break. I don't understand in what all cases <w:lastRenderedPageBreak /> occurs. It occurs in some of the implict page break cases but not all. For example: I typed some text and then pressed enter several times and cursor goes to the next page and if I still press enter several times in the new page this is what I get

    **DOCUMENT.XML**
- <w:p w:rsidR="00266B4A" w:rsidRDefault="00266B4A" w:rsidP="00266B4A">
-   <w:r>
      <w:t xml:space="preserve">All my fun TEXT.</w:t>
</w:r>
</w:p>
  <w:p w:rsidR="0061403F" w:rsidRDefault="0061403F" w:rsidP="00266B4A" />
  <w:p w:rsidR="00266B4A" w:rsidRDefault="00266B4A" w:rsidP="00266B4A" />
  <w:p w:rsidR="00266B4A" w:rsidRDefault="00266B4A" w:rsidP="00266B4A" />
  <w:p w:rsidR="00266B4A" w:rsidRDefault="00266B4A" w:rsidP="00266B4A" />
  <w:p w:rsidR="00266B4A" w:rsidRDefault="00266B4A" w:rsidP="00266B4A" />
  <w:p w:rsidR="00266B4A" w:rsidRDefault="00266B4A" w:rsidP="00266B4A" />   <-{page break}
  <w:p w:rsidR="00266B4A" w:rsidRDefault="00266B4A" w:rsidP="00266B4A" />
  <w:p w:rsidR="00266B4A" w:rsidRDefault="00266B4A" w:rsidP="00266B4A" />
  <w:p w:rsidR="00266B4A" w:rsidRDefault="00266B4A" w:rsidP="00266B4A" />
- <w:p w:rsidR="00266B4A" w:rsidRDefault="00266B4A" w:rsidP="00266B4A">
-     <w:r>
         <w:t xml:space="preserve">All my fun TEXT.</w:t>
  </w:r>
</w:p>

As you could see even though the cursor goes to the next page as I type enter,there is no clue regarding this activity in document.xml file in extracted word document folder. Can someone help me in finding the implicit page break in the word document so that I can extract the content of the first page of the document? If there is no way of detecting particular page content in openxml, how does pdf conversion tools work where each word document page is converted as a page in pdf?

Please do not suggest using APIs like POI which have no provision to extract particular page content. Edit : The reason for finding the implicit page break is because my task involves extracting the cover image in a word document.The heuristics that im following is "if the first page of the document contains only an image then it is a cover image otherwise there is no cover image ".So i need to get the content of the first page alone and check if it has only an image.How can i do it ?

回答1:

The short answer is that it's not possible to do what you want by examining the XML. The page rendering engine of Word (or a PDF converter) is what determines where the page breaks. The XML simply describes the content to be "flowed" by the rendering engine.