PDFBox: Remove a single field from PDF

2019-12-16 21:50发布

问题:

The simplest way I can describe the problem is that we to use PDFbox to remove only one field from a PDF that is sent to us from HelloSign. (e.g. Credit Card Number)

  1. The data in question will always be on the last page, and it will always be at the same coordinates in the page.
  2. The data needs to be completely removed from the PDF. We can't simply change the font to white or draw a box on top as it will still be selectable, and thus, can be copied.
  3. Only that one field can be removed. We still need the other fields and the signatures.
  4. I've created a sample document and uploaded it to Dropbox. input.pdf
  5. For the sake of this question, let's assume the field to be removed is the Street Address from the file I uploaded. Not the City, State, Zip, Signatures, or Dates. (In real life it will be a sensitive data field like a Credit Card Number or SSN.)

I'm putting a loooong-winded explanation of the problem and what I've tried so far in the first comment below.

回答1:

The code in this answer probably appears to be somewhat generic as it first determines a map of fields in the document and then allows to delete any combination of the text fields. Please be aware, though, that it has been developed with only the single example PDF from this question. Thus, I cannot be sure if I correctly understood the way fields are marked for/by HelloSign and in particular the way HelloSign fills these fields.

This answer presents two classes, one which analyzes a HelloSign form and one which manipulates it by clearing selected fields; the latter one relies on the information gathered by the former. Both classes are built upon the PDFBox PDFTextStripper utility class.

The code has been developed for the current PDFBox development version 2.1.0-SNAPSHOT. Most likely it works with all 2.0.x versions, too.

HelloSignAnalyzer

This class analyzes the given PDDocument looking for the sequences

  • [$varname ] which appear to define placeholders for placing form field contents, and
  • [def:$varname|type|req|signer|display|label] which appear to define properties of the placeholders.

It creates a collection of HelloSignField instances each of which describes such a placeholder. They also contain the value of the respective field if text could be found located over the placeholder.

Furthermore it stores the name of the last xobject drawn on the page which in case of the sample document is the place where HelloSign draws its field contents.

public class HelloSignAnalyzer extends PDFTextStripper
{
    public class HelloSignField
    {
        public String getName()
        { return name; }
        public String getValue()
        { return value; }
        public float getX()
        { return x; }
        public float getY()
        { return y; }
        public float getWidth()
        { return width; }
        public String getType()
        { return type; }
        public boolean isOptional()
        { return optional; }
        public String getSigner()
        { return signer; }
        public String getDisplay()
        { return display; }
        public String getLabel()
        { return label; }
        public float getLastX()
        { return lastX; }

        String name = null;
        String value = "";
        float x = 0, y = 0, width = 0;
        String type = null;
        boolean optional = false;
        String signer = null;
        String display = null;
        String label = null;

        float lastX = 0;

        @Override
        public String toString()
        {
            return String.format("[Name: '%s'; Value: `%s` Position: %s, %s; Width: %s; Type: '%s'; Optional: %s; Signer: '%s'; Display: '%s', Label: '%s']",
                    name, value, x, y, width, type, optional, signer, display, label);
        }

        void checkForValue(List<TextPosition> textPositions)
        {
            for (TextPosition textPosition : textPositions)
            {
                if (inField(textPosition))
                {
                    float textX = textPosition.getTextMatrix().getTranslateX();
                    if (textX > lastX + textPosition.getWidthOfSpace() / 2 && value.length() > 0)
                        value += " ";
                    value += textPosition.getUnicode();
                    lastX = textX + textPosition.getWidth();
                }
            }
        }

        boolean inField(TextPosition textPosition)
        {
            float yPos = textPosition.getTextMatrix().getTranslateY();
            float xPos = textPosition.getTextMatrix().getTranslateX();

            return inField(xPos, yPos);
        }

        boolean inField(float xPos, float yPos)
        {
            if (yPos < y - 3 || yPos > y + 3)
                return false;

            if (xPos < x - 1 || xPos > x + width + 1)
                return false;

            return true;
        }
    }

    public HelloSignAnalyzer(PDDocument pdDocument) throws IOException
    {
        super();
        this.pdDocument = pdDocument;
    }

    public Map<String, HelloSignField> analyze() throws IOException
    {
        if (!analyzed)
        {
            fields = new HashMap<>();

            setStartPage(pdDocument.getNumberOfPages());
            getText(pdDocument);

            analyzed = true;
        }
        return Collections.unmodifiableMap(fields);
    }

    public String getLastFormName()
    {
        return lastFormName;
    }

    //
    // PDFTextStripper overrides
    //
    @Override
    protected void writeString(String text, List<TextPosition> textPositions) throws IOException
    {
        {
            for (HelloSignField field : fields.values())
            {
                field.checkForValue(textPositions);
            }
        }

        int position = -1;
        while ((position = text.indexOf('[', position + 1)) >= 0)
        {
            int endPosition = text.indexOf(']', position);
            if (endPosition < 0)
                continue;
            if (endPosition > position + 1 && text.charAt(position + 1) == '$')
            {
                String fieldName = text.substring(position + 2, endPosition);
                int spacePosition = fieldName.indexOf(' ');
                if (spacePosition >= 0)
                    fieldName = fieldName.substring(0, spacePosition);
                HelloSignField field = getOrCreateField(fieldName);

                TextPosition start = textPositions.get(position);
                field.x = start.getTextMatrix().getTranslateX();
                field.y = start.getTextMatrix().getTranslateY();
                TextPosition end = textPositions.get(endPosition);
                field.width = end.getTextMatrix().getTranslateX() + end.getWidth() - field.x;
            }
            else if (endPosition > position + 5 && "def:$".equals(text.substring(position + 1, position + 6)))
            {
                String definition = text.substring(position + 6, endPosition);
                String[] pieces = definition.split("\\|");
                if (pieces.length == 0)
                    continue;
                HelloSignField field = getOrCreateField(pieces[0]);

                if (pieces.length > 1)
                    field.type = pieces[1];
                if (pieces.length > 2)
                    field.optional = !"req".equals(pieces[2]);
                if (pieces.length > 3)
                    field.signer = pieces[3];
                if (pieces.length > 4)
                    field.display = pieces[4];
                if (pieces.length > 5)
                    field.label = pieces[5];
            }
        }

        super.writeString(text, textPositions);
    }

    @Override
    protected void processOperator(Operator operator, List<COSBase> operands) throws IOException
    {
        String currentFormName = formName; 
        if (operator != null && "Do".equals(operator.getName()) && operands != null && operands.size() > 0)
        {
            COSBase base0 = operands.get(0);
            if (base0 instanceof COSName)
            {
                formName = ((COSName)base0).getName();
                if (currentFormName == null)
                    lastFormName = formName;
            }
        }
        try
        {
            super.processOperator(operator, operands);
        }
        finally
        {
            formName = currentFormName;
        }
    }

    //
    // helper methods
    //
    HelloSignField getOrCreateField(String name)
    {
        HelloSignField field = fields.get(name);
        if (field == null)
        {
            field = new HelloSignField();
            field.name = name;
            fields.put(name, field);
        }
        return field;
    }

    //
    // inner member variables
    //
    final PDDocument pdDocument;
    boolean analyzed = false;
    Map<String, HelloSignField> fields = null;
    String formName = null;
    String lastFormName = null;
}

(HelloSignAnalyzer.java)

Usage

One can apply the HelloSignAnalyzer to a document as follows:

PDDocument pdDocument = PDDocument.load(...);

HelloSignAnalyzer helloSignAnalyzer = new HelloSignAnalyzer(pdDocument);

Map<String, HelloSignField> fields = helloSignAnalyzer.analyze();

System.out.printf("Found %s fields:\n\n", fields.size());

for (Map.Entry<String, HelloSignField> entry : fields.entrySet())
{
    System.out.printf("%s -> %s\n", entry.getKey(), entry.getValue());
}

System.out.printf("\nLast form name: %s\n", helloSignAnalyzer.getLastFormName());

(PlayWithHelloSign.java test method testAnalyzeInput)

In case of the OP's sample document the output is

Found 8 fields:

var1001 -> [Name: 'var1001'; Value: `123 Main St.` Position: 90.0, 580.0; Width: 165.53601; Type: 'text'; Optional: false; Signer: 'signer1'; Display: 'Address', Label: 'address1']
var1004 -> [Name: 'var1004'; Value: `12345` Position: 210.0, 564.0; Width: 45.53601; Type: 'text'; Optional: false; Signer: 'signer1'; Display: 'Postal Code', Label: 'zip']
var1002 -> [Name: 'var1002'; Value: `TestCity` Position: 90.0, 564.0; Width: 65.53601; Type: 'text'; Optional: false; Signer: 'signer1'; Display: 'City', Label: 'city']
var1003 -> [Name: 'var1003'; Value: `AA` Position: 161.0, 564.0; Width: 45.53601; Type: 'text'; Optional: false; Signer: 'signer1'; Display: 'State', Label: 'state']
date2 -> [Name: 'date2'; Value: `2016/12/09` Position: 397.0, 407.0; Width: 124.63202; Type: 'date'; Optional: false; Signer: 'signer2'; Display: 'null', Label: 'null']
signature1 -> [Name: 'signature1'; Value: `` Position: 88.0, 489.0; Width: 236.624; Type: 'sig'; Optional: false; Signer: 'signer1'; Display: 'null', Label: 'null']
date1 -> [Name: 'date1'; Value: `2016/12/09` Position: 397.0, 489.0; Width: 124.63202; Type: 'date'; Optional: false; Signer: 'signer1'; Display: 'null', Label: 'null']
signature2 -> [Name: 'signature2'; Value: `` Position: 88.0, 407.0; Width: 236.624; Type: 'sig'; Optional: false; Signer: 'signer2'; Display: 'null', Label: 'null']

Last form name: Xi0

HelloSignManipulator

This class makes use of the information a HelloSignAnalyzer has gathered to clear the contents of text fields given by their name.

public class HelloSignManipulator extends PDFTextStripper
{
    public HelloSignManipulator(HelloSignAnalyzer helloSignAnalyzer) throws IOException
    {
        super();
        this.helloSignAnalyzer = helloSignAnalyzer;
        addOperator(new SelectiveDrawObject());
    }

    public void clearFields(Iterable<String> fieldNames) throws IOException
    {
        try
        {
            Map<String, HelloSignField> fieldMap = helloSignAnalyzer.analyze();
            List<HelloSignField> selectedFields = new ArrayList<>();
            for (String fieldName : fieldNames)
            {
                selectedFields.add(fieldMap.get(fieldName));
            }
            fields = selectedFields;

            PDDocument pdDocument = helloSignAnalyzer.pdDocument;
            setStartPage(pdDocument.getNumberOfPages());
            getText(pdDocument);
        }
        finally
        {
            fields = null;
        }
    }

    class SelectiveDrawObject extends OperatorProcessor
    {
        @Override
        public void process(Operator operator, List<COSBase> arguments) throws IOException
        {
            if (arguments.size() < 1)
            {
                throw new MissingOperandException(operator, arguments);
            }
            COSBase base0 = arguments.get(0);
            if (!(base0 instanceof COSName))
            {
                return;
            }
            COSName name = (COSName) base0;

            if (replacement != null || !helloSignAnalyzer.getLastFormName().equals(name.getName()))
            {
                return;
            }

            if (context.getResources().isImageXObject(name))
            {
                throw new IllegalArgumentException("The form xobject to edit turned out to be an image.");
            }

            PDXObject xobject = context.getResources().getXObject(name);

            if (xobject instanceof PDTransparencyGroup)
            {
                throw new IllegalArgumentException("The form xobject to edit turned out to be a transparency group.");
            }
            else if (xobject instanceof PDFormXObject)
            {
                PDFormXObject form = (PDFormXObject) xobject;
                PDFormXObject formReplacement = new PDFormXObject(helloSignAnalyzer.pdDocument);
                formReplacement.setBBox(form.getBBox());
                formReplacement.setFormType(form.getFormType());
                formReplacement.setMatrix(form.getMatrix().createAffineTransform());
                formReplacement.setResources(form.getResources());
                OutputStream outputStream = formReplacement.getContentStream().createOutputStream(COSName.FLATE_DECODE);
                replacement = new ContentStreamWriter(outputStream);

                context.showForm(form);

                outputStream.close();
                getResources().put(name, formReplacement);
                replacement = null;
            }
        }

        @Override
        public String getName()
        {
            return "Do";
        }
    }

    //
    // PDFTextStripper overrides
    //
    @Override
    protected void processOperator(Operator operator, List<COSBase> operands) throws IOException
    {
        if (replacement != null)
        {
            boolean copy = true;
            if (TjTJ.contains(operator.getName()))
            {
                Matrix transformation = getTextMatrix().multiply(getGraphicsState().getCurrentTransformationMatrix());
                float xPos = transformation.getTranslateX();
                float yPos = transformation.getTranslateY();
                for (HelloSignField field : fields)
                {
                    if (field.inField(xPos, yPos))
                    {
                        copy = false;
                    }
                }
            }

            if (copy)
            {
                replacement.writeTokens(operands);
                replacement.writeToken(operator);
            }
        }
        super.processOperator(operator, operands);
    }

    //
    // helper methods
    //
    final HelloSignAnalyzer helloSignAnalyzer;
    final Collection<String> TjTJ = Arrays.asList("Tj", "TJ");
    Iterable<HelloSignField> fields;
    ContentStreamWriter replacement = null;
}

(HelloSignManipulator.java)

Usage: Clear single field

One can apply the HelloSignManipulator to a document as follows to clear a single field:

PDDocument pdDocument = PDDocument.load(...);

HelloSignAnalyzer helloSignAnalyzer = new HelloSignAnalyzer(pdDocument);

HelloSignManipulator helloSignManipulator = new HelloSignManipulator(helloSignAnalyzer);

helloSignManipulator.clearFields(Collections.singleton("var1001"));

pdDocument.save(...);

(PlayWithHelloSign.java test method testClearAddress1Input)

Usage: Clear multiple fields at once

One can apply the HelloSignManipulator to a document as follows to clear multiple fields at once:

PDDocument pdDocument = PDDocument.load(...);

HelloSignAnalyzer helloSignAnalyzer = new HelloSignAnalyzer(pdDocument);

HelloSignManipulator helloSignManipulator = new HelloSignManipulator(helloSignAnalyzer);

helloSignManipulator.clearFields(Arrays.asList("var1004", "var1003", "date2"));

pdDocument.save(...);

(PlayWithHelloSign.java test method testClearZipStateDate2Input)

Usage: Clear multiple fields successively

One can apply the HelloSignManipulator to a document as follows to clear multiple fields successively:

PDDocument pdDocument = PDDocument.load(...);

HelloSignAnalyzer helloSignAnalyzer = new HelloSignAnalyzer(pdDocument);

HelloSignManipulator helloSignManipulator = new HelloSignManipulator(helloSignAnalyzer);

helloSignManipulator.clearFields(Collections.singleton("var1004"));
helloSignManipulator.clearFields(Collections.singleton("var1003"));
helloSignManipulator.clearFields(Collections.singleton("date2"));

pdDocument.save(...);

(PlayWithHelloSign.java test method testClearZipStateDate2SuccessivelyInput)

Caveat

These classes are mere proofs-of-concept. On one hand they are built based on a single example HelloSign file, so there is a huge chance of having missed important details. On the other hand there are some built-in assumptions, e.g. in the HelloSignField method inField.

Furthermore, manipulating signed HelloSign files in general might be a bad idea. If I understood their concept correctly, they store a hash of each signed document to allow verification of the content, and if the document is manipulated as shown above, the hash value won't match anymore.



标签: java pdf pdfbox