I have been trying to segment a docx document to multiple documents based on a predefined criteria. following is my approach to cut it to paragraphs
try {
FileInputStream in = new FileInputStream(file);
XWPFDocument doc = new XWPFDocument(in);
List<XWPFParagraph> paragraphs = doc.getParagraphs();
for (int idx = 0; idx < paragraphs.size(); idx++) {
XWPFDocument outputDocument = new XWPFDocument();
createParagraphInAnotherDocument(outputDocument, paragraphs.get(idx).getText());
String fullPath = String.format("./content/output/%1$s_%2$s_%3$04d.docx", FileUtils.getFileName(file), getName(), idx);
FileOutputStream outputStream = new FileOutputStream(fullPath);
outputDocument.write(outputStream);
outputDocument.close();
doc.close();
}
} catch (IOException e) {
e.printStackTrace();
}
While I am able to extract paragraphs with the code above, I can't find a way to extract pages. My understanding is that pages in word are render concern, and it happens in the runtime in the word application.
As far as I can see, the only way to do this is by interrogating the DOM model for the Word doc, and then determining how many paragraphs there are on each page. Below is a possible solution to the problem (it only works if the pages are explicitly separated by page breaks)