PDFBox delete comment maintain strikethrough

2019-06-11 14:54发布

问题:

I have a PDF which has a comment on a paragraph. This paragraph is strickedthrough. My requirement is to delete the command from a specific page.

The following code should delete a specific comment from my PDF but it does not.

PDDocument document = PDDocument.load(...File...);
List<PDAnnotation> annotations = new ArrayList<>();
PDPageTree allPages = document.getDocumentCatalog().getPages();

for (int i = 0; i < allPages.getCount(); i++) {
    PDPage page = allPages.get(i);
    annotations = page.getAnnotations();

    List<PDAnnotation> annotationToRemove = new ArrayList<PDAnnotation>();

    if (annotations.size() < 1)
        continue;
    else {
        for (PDAnnotation annotation : annotations) {

            if (annotation.getContents() != null && annotation.getContents().equals("Sample Strikethrough")) {
                annotationToRemove.add(annotation);
            }
        }
        annotations.removeAll(annotationToRemove);
    }
}

What is the best way to remove a specific comment and maintain a strikethrough on the text that the comment was appliaed?

回答1:

What is the best way to remove a specific comment and maintain a strikethrough on the text that the comment was appliaed?

The annotation you found actually is a text markup annotation of subtype StrikeOut, i.e. the main appearance of this annotation is the strikethrough. Thus, you must not remove this annotation. Instead you should remove the data from which the additional appearance of the annotation, the hover text, is generated.

This can be done like this:

final COSName POPUP = COSName.getPDFName("Popup");

PDDocument document = PDDocument.load(resource);
List<PDAnnotation> annotations = new ArrayList<>();
PDPageTree allPages = document.getDocumentCatalog().getPages();

List<COSObjectable> objectsToRemove = new ArrayList<>();

for (int i = 0; i < allPages.getCount(); i++) {
    PDPage page = allPages.get(i);
    annotations = page.getAnnotations();

    for (PDAnnotation annotation : annotations) {
        if ("StrikeOut".equals(annotation.getSubtype()))
        {
            COSDictionary annotationDict = annotation.getCOSObject();
            COSBase popup = annotationDict.getItem(POPUP);
            annotationDict.removeItem(POPUP);            // popup annotation
            annotationDict.removeItem(COSName.CONTENTS); // plain text comment
            annotationDict.removeItem(COSName.RC);       // rich text comment
            annotationDict.removeItem(COSName.T);        // author

            if (popup != null)
                objectsToRemove.add(popup);
        }
    }

    annotations.removeAll(objectsToRemove);
}

(RemoveStrikeoutComment.java test testRemoveLikeStephanImproved)


As a side effect of looking into this a PDFBox bug became apparent: The original code by the OP should have removed the StrikeOut annotation completely but it did nothing. The reason is a bug in the usage of the COSArrayList class in the context of page annotations.

The page annotation list returned by page.getAnnotations() is an instance of COSArrayList. This class carries both a list of COS objects as they appear in the page Annots array and a list of wrappers for those entries (after resolving indirect references where necessary).

The removeAll method (sensibly) checks its argument collection for such wrappers and removes the actual COS objects, not the wrappers, from the former collection and the argument collection as is (i.e. with wrappers) from the latter.

This works well for direct objects in the Annots array, but entries in the former list which are indirect references aren't properly removed as the code tries to remove the resolved annotation dictionaries while that list actually contains indirect references.

In the case at hand that results in removals not being written back. In more generic situations the results can even be weirder as the two lists have different sizes now. Index oriented methods, therefore, can now manipulate non-corresponding objects of the lists...

(BTW, in my code above I remove an indirect reference, not a wrapper, leaving the lists in disarray, too, as this time only an entry of the former, not the latter list is removed; probably this should also be handled more securely.)

A similar problem occurs in the retainAll method.

Another glitch: COSArrayList.lastIndexOf uses indexOf of the contained list.

The PDFBox source this has been analysed with is the current 3.0.0-SNAPSHOT, but the error occurs with all versions 2.0.0 - 2.0.7, so their code very likely contains these errors, too.



标签: java pdf pdfbox