PDFBox: How to “flatten” a PDF-form?

2019-01-17 23:57发布

站内文章 / Java

30 0

问题:

How do I "flatten" a PDF-form (remove the form-field but keep the text of the field) with PDFBox?

Same question was answered here:

a quick way to do this, is to remove the fields from the acrofrom.

For this you just need to get the document catalog, then the acroform and then remove all fields from this acroform.

The graphical representation is linked with the annotation and stay in the document.

So I wrote this code:

import java.io.File;
import java.util.ArrayList;
import java.util.List;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDDocumentCatalog;
import org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm;
import org.apache.pdfbox.pdmodel.interactive.form.PDField;

public class PdfBoxTest {
    public void test() throws Exception {
        PDDocument pdDoc = PDDocument.load(new File("E:\\Form-Test.pdf"));
        PDDocumentCatalog pdCatalog = pdDoc.getDocumentCatalog();
        PDAcroForm acroForm = pdCatalog.getAcroForm();

        if (acroForm == null) {
            System.out.println("No form-field --> stop");
            return;
        }

        @SuppressWarnings("unchecked")
        List<PDField> fields = acroForm.getFields();

        // set the text in the form-field <-- does work
        for (PDField field : fields) {
            if (field.getFullyQualifiedName().equals("formfield1")) {
                field.setValue("Test-String");
            }
        }

        // remove form-field but keep text ???
        // acroForm.getFields().clear();         <-- does not work
        // acroForm.setFields(null);             <-- does not work
        // acroForm.setFields(new ArrayList());  <-- does not work
        // ???

        pdDoc.save("E:\\Form-Test-Result.pdf");
        pdDoc.close();
    }
}

回答1:

With PDFBox 2 it's now possible to "flatten" a PDF-form easily with this new API method: PDAcroForm.flatten().

Simplified code with an example call of this method:

//Load the document
PDDocument pDDocument = PDDocument.load(new File("E:\\Form-Test.pdf"));    
PDAcroForm pDAcroForm = pDDocument.getDocumentCatalog().getAcroForm();

//Fill the document
...

//Flatten the document
pDAcroForm.flatten();

//Save the document
pDDocument.save("E:\\Form-Test-Result.pdf");
pDDocument.close();

Note: dynamic XFA forms cannot be flatten.

For migration from PDFBox 1.* to 2.0, take a look at the official migration guide.

回答2:

This works for sure - I've ran into this problem, debugged all-night, but finally figured out how to do this :)

This is assuming that you have capability to edit the PDF in some way/have some control over the PDF.

First, edit the forms using Acrobat Pro. Make them hidden and read-only.

Then you need to use two libraries: PDFBox and PDFClown.

PDFBox removes the thing that tells Adobe Reader that it's a form; PDFClown removes the actual field. PDFClown must be done first, then PDFBox (in that order. The other way around doesn't work).

Single field example code:

// PDF Clown code
File file = new File("Some file path"); 
Document document = file.getDocument();
Form form = file.getDocument.getForm();
Fields fields = form.getFields();
Field field = fields.get("some_field_name");

PageStamper stamper = new PageStamper(); 
FieldWidgets widgets = field.getWidgets();
Widget widget = widgets.get(0); // Generally is 0.. experiment to figure out
stamper.setPage(widget.getPage());

// Write text using text form field position as pivot.
PrimitiveComposer composer = stamper.getForeground();
Font font = font.get(document, "some_path"); 
composer.setFont(font, 10); 
double xCoordinate = widget.getBox().getX();
double yCoordinate = widget.getBox().getY(); 
composer.showText("text i want to display", new Point2D.Double(xCoordinate, yCoordinate)); 

// Actually delete the form field!
field.delete();
stamper.flush(); 

// Create new buffer to output to... 
Buffer buffer = new Buffer();
file.save(buffer, SerializationModeEnum.Standard); 
byte[] bytes = buffer.toByteArray(); 

// PDFBox code
InputStream pdfInput = new ByteArrayInputStream(bytes);
PDDocument pdfDocument = PDDocument.load(pdfInput);

// Tell Adobe we don't have forms anymore.
PDDocumentCatalog pdCatalog = pdfDocument.getDocumentCatalog();
PDAcroForm acroForm = pdCatalog.getAcroForm();
COSDictionary acroFormDict = acroForm.getDictionary();
COSArray cosFields = (COSArray) acroFormDict.getDictionaryObject("Fields");
cosFields.clear();

// Phew. Finally.
pdfDocument.save("Some file path");

Probably some typos here and there, but this should be enough to get the gist :)

回答3:

setReadOnly did work for me as shown below -

   @SuppressWarnings("unchecked")
    List<PDField> fields = acroForm.getFields();
    for (PDField field : fields) {
        if (field.getFullyQualifiedName().equals("formfield1")) {
            field.setReadOnly(true);
        }
    }

回答4:

After reading about pdf reference guide, I have discovered that you can quite easily set read-only mode for AcroForm fields by adding "Ff" key (Field flags) with value 1. This is what documentation stands about that:

If set, the user may not change the value of the field. Any associated widget annotations will not interact with the user; that is, they will not respond to mouse clicks or change their appearance in response to mouse motions. This flag is useful for fields whose values are computed or imported from a database.

so the code could look like that (using pdfbox lib):

 public static void makeAllWidgetsReadOnly(PDDocument pdDoc) throws IOException {

    PDDocumentCatalog catalog = pdDoc.getDocumentCatalog();

    PDAcroForm form = catalog.getAcroForm();

    List<PDField> acroFormFields = form.getFields();

    System.out.println(String.format("found %d acroFrom fields", acroFormFields.size()));

    for(PDField field: acroFormFields) {
        makeAcroFieldReadOnly(field);
    }
}

private static void makeAcroFieldReadOnly(PDField field) {

    field.getDictionary().setInt("Ff",1);

}

回答5:

Solution to flattening acroform AND retaining the form field values using pdfBox:

see solution at https://mail-archives.apache.org/mod_mbox/pdfbox-users/201604.mbox/%3C3BC7E352-9447-4458-AAC3-5A9B70B4CCAA@fileaffairs.de%3E

The solution that worked for me with pdfbox 2.0.1:

File myFile = new File("myFile.pdf");
PDDocument pdDoc = PDDocument.load(myFile);
PDDocumentCatalog pdCatalog = pdDoc.getDocumentCatalog();
PDAcroForm pdAcroForm = pdCatalog.getAcroForm();

// set the NeedAppearances flag to false
pdAcroForm.setNeedAppearances(false);


field.setValue("new-value");

pdAcroForm.flatten();
pdDoc.save("myFlattenedFile.pdf");
pdDoc.close();

I didn't need to do the 2 extra steps in the above solution link:

// correct the missing page link for the annotations
// Add the missing resources to the form

I created my pdf form in OpenOffice 4.1.1 and exported to pdf. The 2 items selected in the OpenOffice export dialogue were:

selected "create Pdf Form"
Submit format of "PDF" - I found this gave smaller pdf file size than selecting "FDF" but still operated as a pdf form.

Using PdfBox I populated the form fields and created a flattened pdf file that removed the form fields but retained the form field values.

回答6:

In order to really "flatten" an acrobat form field there seems to be much more to do than at the first glance. After examining the PDF standard I managed to achieve real flatening in three steps:

save field value
remove widgets
remove form field

All three steps can be done with pdfbox (I used 1.8.5). Below I will sketch how I did it. A very helpful tool in order to understand whats going on is the PDF Debugger.

Save the field

This is the most complicated step of the three.

In order to save the field's value you have to save its content to the pdf's content for each of the field's widgets. Easiest way to do so is drawing each widget's appearance to the widget's page.

void saveFieldValue( PDField field ) throws IOException
{
    PDDocument document = getDocument( field );
    // see PDField.getWidget()
    for( PDAnnotationWidget widget : getWidgets( field ) )
    {
        PDPage parentPage = getPage( widget );

        try (PDPageContentStream contentStream = new PDPageContentStream( document, parentPage, true, true ))
        {
            writeContent( contentStream, widget );
        }
    }
}

void writeContent( PDPageContentStream contentStream, PDAnnotationWidget widget )
        throws IOException
{
    PDAppearanceStream appearanceStream = getAppearanceStream( widget );
    PDXObject xobject = new PDXObjectForm( appearanceStream.getStream() );
    AffineTransform transformation = getPositioningTransformation( widget.getRectangle() );

    contentStream.drawXObject( xobject, transformation );
}

The appearance is an XObject stream containing all of the widget's content (value, font, size, rotation, etc.). You simply need to place it at the right position on the page which you can extract from the widget's rectangle.

Remove widgets

As noted above each field may have multiple widgets. A widget takes care of how a form field can be edited, triggers, displaying when not editing and such stuff.

In order to remove one you have to remove it from its page's annotations.

void removeWidget( PDAnnotationWidget widget ) throws IOException
{
    PDPage widgetPage = getPage( widget );
    List<PDAnnotation> annotations = widgetPage.getAnnotations();
    PDAnnotation deleteCandidate = getMatchingCOSObjectable( annotations, widget );
    if( deleteCandidate != null && annotations.remove( deleteCandidate ) )
        widgetPage.setAnnotations( annotations );
}

Note that the annotations may not contain the exact PDAnnotationWidget since it's a kind of a wrapper. You have to remove the one with matching COSObject.

Remove form field

As final step you remove the form field itself. This is not very different to the other posts above.

void removeFormfield( PDField field ) throws IOException
{
    PDAcroForm acroForm = field.getAcroForm();
    List<PDField> acroFields = acroForm.getFields();
    List<PDField> removeCandidates = getFields( acroFields, field.getPartialName() );
    if( removeAll( acroFields, removeCandidates ) )
        acroForm.setFields( acroFields );
}

Note that I used a custom removeAll method here since the removeCandidates.removeAll() didn't work as expected for me.

Sorry that I cannot provide all the code here but with the above you should be able to write it yourself.

回答7:

This is the code I came up with after synthesizing all of the answers I could find on the subject. This handles flattening text boxes, combos, lists, checkboxes, and radios:

public static void flattenPDF (PDDocument doc) throws IOException {

    //
    //  find the fields and their kids (widgets) on the input document
    //  (each child widget represents an appearance of the field data on the page, there may be multiple appearances)
    //
    PDDocumentCatalog catalog = doc.getDocumentCatalog();
    PDAcroForm form = catalog.getAcroForm();
    List<PDField> tmpfields = form.getFields();
    PDResources formresources = form.getDefaultResources();
    Map formfonts = formresources.getFonts();
    PDAnnotation ann;

    //
    // for each input document page convert the field annotations on the page into
    // content stream
    //
    List<PDPage> pages = catalog.getAllPages();
    Iterator<PDPage> pageiterator = pages.iterator();
    while (pageiterator.hasNext()) {
        //
        // get next page from input document
        //
        PDPage page = pageiterator.next();

        //
        // add the fonts from the input form to this pages resources
        // so the field values will display in the proper font
        //
        PDResources pageResources = page.getResources();
        Map pageFonts = pageResources.getFonts();
        pageFonts.putAll(formfonts);
        pageResources.setFonts(pageFonts);

        //
        // Create a content stream for the page for appending
        //
        PDPageContentStream contentStream = new PDPageContentStream(doc, page, true, true);

        //
        // Find the appearance widgets for all fields on the input page and insert them into content stream of the page
        //
        for (PDField tmpfield : tmpfields) {
            List widgets = tmpfield.getKids();
            if(widgets == null) {
                widgets = new ArrayList();
                widgets.add(tmpfield.getWidget());
            }
            Iterator<COSObjectable> widgetiterator = widgets.iterator();
            while (widgetiterator.hasNext()) {
                COSObjectable next = widgetiterator.next();
                if (next instanceof PDField) {
                    PDField foundfield = (PDField) next;
                    ann = foundfield.getWidget();
                } else {
                    ann = (PDAnnotation) next;
                }
                if (ann.getPage().equals(page)) {
                    COSDictionary dict = ann.getDictionary();
                    if (dict != null) {
                        if(tmpfield instanceof PDVariableText || tmpfield instanceof PDPushButton) {
                            COSDictionary ap = (COSDictionary) dict.getDictionaryObject("AP");
                            if (ap != null) {

                                contentStream.appendRawCommands("q\n");
                                COSArray rectarray = (COSArray) dict.getDictionaryObject("Rect");
                                if (rectarray != null) {
                                    float[] rect = rectarray.toFloatArray();
                                    String s = " 1 0 0 1  " + Float.toString(rect[0]) + " " + Float.toString(rect[1]) + " cm\n";

                                    contentStream.appendRawCommands(s);
                                }
                                COSStream stream = (COSStream) ap.getDictionaryObject("N");
                                if (stream != null) {
                                    InputStream ioStream = stream.getUnfilteredStream();
                                    ByteArrayOutputStream byteArray = new ByteArrayOutputStream();
                                    byte[] buffer = new byte[4096];
                                    int amountRead = 0;
                                    while ((amountRead = ioStream.read(buffer, 0, buffer.length)) != -1) {
                                        byteArray.write(buffer, 0, amountRead);
                                    }

                                    contentStream.appendRawCommands(byteArray.toString() + "\n");
                                }

                                contentStream.appendRawCommands("Q\n");
                            }
                        } else if (tmpfield instanceof PDChoiceButton) {
                            COSDictionary ap = (COSDictionary) dict.getDictionaryObject("AP");
                            if(ap != null) {
                                contentStream.appendRawCommands("q\n");
                                COSArray rectarray = (COSArray) dict.getDictionaryObject("Rect");
                                if (rectarray != null) {
                                    float[] rect = rectarray.toFloatArray();
                                    String s = " 1 0 0 1  " + Float.toString(rect[0]) + " " + Float.toString(rect[1]) + " cm\n";

                                    contentStream.appendRawCommands(s);
                                }

                                COSName cbValue = (COSName) dict.getDictionaryObject(COSName.AS);
                                COSDictionary d = (COSDictionary) ap.getDictionaryObject(COSName.D);
                                if (d != null) {
                                    COSStream stream = (COSStream) d.getDictionaryObject(cbValue);
                                    if(stream != null) {
                                        InputStream ioStream = stream.getUnfilteredStream();
                                        ByteArrayOutputStream byteArray = new ByteArrayOutputStream();
                                        byte[] buffer = new byte[4096];
                                        int amountRead = 0;
                                        while ((amountRead = ioStream.read(buffer, 0, buffer.length)) != -1) {
                                            byteArray.write(buffer, 0, amountRead);
                                        }

                                        if (!(tmpfield instanceof PDCheckbox)){
                                            contentStream.appendRawCommands(byteArray.toString() + "\n");
                                        }
                                    }
                                }

                                COSDictionary n = (COSDictionary) ap.getDictionaryObject(COSName.N);
                                if (n != null) {
                                    COSStream stream = (COSStream) n.getDictionaryObject(cbValue);
                                    if(stream != null) {
                                        InputStream ioStream = stream.getUnfilteredStream();
                                        ByteArrayOutputStream byteArray = new ByteArrayOutputStream();
                                        byte[] buffer = new byte[4096];
                                        int amountRead = 0;
                                        while ((amountRead = ioStream.read(buffer, 0, buffer.length)) != -1) {
                                            byteArray.write(buffer, 0, amountRead);
                                        }

                                        contentStream.appendRawCommands(byteArray.toString() + "\n");
                                    }
                                }

                                contentStream.appendRawCommands("Q\n");
                            }
                        }
                    }
                }
            }
        }

        // delete any field widget annotations and write it all to the page
        // leave other annotations on the page
        COSArrayList newanns = new COSArrayList();
        List anns = page.getAnnotations();
        ListIterator annotiterator = anns.listIterator();
        while (annotiterator.hasNext()) {
            COSObjectable next = (COSObjectable) annotiterator.next();
            if (!(next instanceof PDAnnotationWidget)) {
                newanns.add(next);
            }
        }

        page.setAnnotations(newanns);
        contentStream.close();
    }

    //
    // Delete all fields from the form and their widgets (kids)
    //
    for (PDField tmpfield : tmpfields) {
        List kids = tmpfield.getKids();
        if(kids != null) kids.clear();
    }

    tmpfields.clear();

    // Tell Adobe we don't have forms anymore.
    PDDocumentCatalog pdCatalog = doc.getDocumentCatalog();
    PDAcroForm acroForm = pdCatalog.getAcroForm();
    COSDictionary acroFormDict = acroForm.getDictionary();
    COSArray cosFields = (COSArray) acroFormDict.getDictionaryObject("Fields");
    cosFields.clear();
}

Full class here: https://gist.github.com/jribble/beddf7620536939f88db

回答8:

This is the answer of Thomas, from the PDFBox-Mailinglist:

You will need to get the Fields over the COSDictionary. Try this code...

PDDocument pdDoc = PDDocument.load(new File("E:\\Form-Test.pdf"));
PDDocumentCatalog pdCatalog = pdDoc.getDocumentCatalog();
PDAcroForm acroForm = pdCatalog.getAcroForm();

COSDictionary acroFormDict = acroForm.getDictionary();
COSArray fields = acroFormDict.getDictionaryObject("Fields");
fields.clear();

回答9:

I don't have enough points to comment but SJohnson's response of setting the field to read only worked perfectly for me. I am using something like this with PDFBox:

private void setFieldValueAndFlatten(PDAcroForm form, String fieldName, String fieldValue) throws IOException {
    PDField field = form.getField(fieldName);
    if(field != null){
        field.setValue(fieldValue);
        field.setReadonly(true);
    }
}

This will write your field value and then when you open the PDF after saving it will have your value and not be editable.

回答10:

I thought I'd share our approach that worked with PDFBox 2+.

We've used the PDAcroForm.flatten() method.

The fields needed some preprocessing and most importantly the nested field structure had to be traversed and DV and V checked for values.

Finally what worked was the following:

private static void flattenPDF(String src, String dst) throws IOException {
    PDDocument doc = PDDocument.load(new File(src));

    PDDocumentCatalog catalog = doc.getDocumentCatalog();
    PDAcroForm acroForm = catalog.getAcroForm();
    PDResources resources = new PDResources();
    acroForm.setDefaultResources(resources);

    List<PDField> fields = new ArrayList<>(acroForm.getFields());
    processFields(fields, resources);
    acroForm.flatten();

    doc.save(dst);
    doc.close();
}

private static void processFields(List<PDField> fields, PDResources resources) {
    fields.stream().forEach(f -> {
        f.setReadOnly(true);
        COSDictionary cosObject = f.getCOSObject();
        String value = cosObject.getString(COSName.DV) == null ?
                       cosObject.getString(COSName.V) : cosObject.getString(COSName.DV);
        System.out.println("Setting " + f.getFullyQualifiedName() + ": " + value);
        try {
            f.setValue(value);
        } catch (IOException e) {
            if (e.getMessage().matches("Could not find font: /.*")) {
                String fontName = e.getMessage().replaceAll("^[^/]*/", "");
                System.out.println("Adding fallback font for: " + fontName);
                resources.put(COSName.getPDFName(fontName), PDType1Font.HELVETICA);
                try {
                    f.setValue(value);
                } catch (IOException e1) {
                    e1.printStackTrace();
                }
            } else {
                e.printStackTrace();
            }
        }
        if (f instanceof PDNonTerminalField) {
            processFields(((PDNonTerminalField) f).getChildren(), resources);
        }
    });
}