pdfbox: how to clone a page

2019-05-03 07:55发布

Using Apache PDFBox, I am editing an existing document and I would like to take one page from that document and simply clone it, copying whatever elements it contains. As an additional twist, I would like to get a reference to all the PDFields for any form fields in this newly cloned page. Here's the code I tried so far:

            PDPage newPage = new PDPage(lastPage.getCOSDictionary());
            PDFCloneUtility cloner = new PDFCloneUtility(pdfDoc);
            pdfDoc.addPage(newPage);
            cloner.cloneMerge(lastPage, newPage);

            // there doesn't seem to be an API to read the fields from the page, need to filter them out from the document.
            List<PDField> newFields = readPdfFields(pdfDoc);
            Iterator<PDField> i = newFields.iterator();
            while (i.hasNext()) {
                if (i.next().getWidget().getPage() != newPage)
                    i.remove();
            }

readPdfFields is a helper method I wrote to get all the fields in a document using the AcroForm.

But this code seems to lead to some kind of crash/hang state in my JVM - I haven't been able to debug exactly what's happening but I'm guessing this is not actually the right way to clone a page. What is?

标签: java pdfbox
1条回答
狗以群分
2楼-- · 2019-05-03 08:36

The least resource intensive way to clone a page is a shallow copy of the corresponding dictionary:

PDDocument doc = PDDocument.load( file );

List<PDPage> allPages = doc.getDocumentCatalog().getAllPages();

PDPage page = allPages.get(0);
COSDictionary pageDict = page.getCOSDictionary();
COSDictionary newPageDict = new COSDictionary(pageDict);

newPageDict.removeItem(COSName.ANNOTS);

PDPage newPage = new PDPage(newPageDict);
doc.addPage(newPage);

doc.save( outfile );

I explicitly deleted the annotations (form fields etc) of the copy because an annotation has a reference pointing back to its page which in the copied page obviously is wrong.

Thus, if you want the annotations to come along in a clean way, you have to create shallow copies of the annotations array and all contained annotation dictionaries, too, and replace the page reference therein.

Most PDF reader would not mind, though, if the page references are incorrect. For a dirty solution, therefore, you could simply leave the annotations in the page dictionary. But who wants to be dirty... ;)

If you want to additionally change some parts of the new or the old page, you obviously also have to copy the respective PDF objects before manipulating them.

Some other remarks:

Your original page cloning to me looks weird. After all you add the identical page dictionary to the document again (duplicate entries in the page tree are ignored, I think) and then do some merge between these identical page objects.

I assume the PDFCloneUtility is meant for cloning between different documents, not inside the same, but merging a dictionary into itself does not need to work.

I would like to get a reference to all the PDFields for any form fields in this newly cloned page

As the fields have the same name, they are identical!

Fields in PDF are abstract fields which can have many appearances spread over the document. The same name implies the same field.

A field appearing on some page means that there is an annotation representing that field on the page. To make things more complicated, field dictionary and annotation dictionary can be merged for fields with one appearance only.

Thus, depending on your requirements you will first have to decide whether you want to work with fields or with field annotations.

查看更多
登录 后发表回答