Merging Tagged PDF without ruining the tags

2019-03-01 15:40发布

问题:

I am trying to merge two Tagged PDF's with the iTextPDF 5.4.4 version jar. After doing all the operations while closing the document on the line: document.close();): . It throws the below error

java.lang.NullPointerException
PDF Creation Failed java.lang.NullPointerException
[B@1d5c1d5c
at com.itextpdf.text.pdf.PdfCopy.fixTaggedStructure(PdfCopy.java:878)
at com.itextpdf.text.pdf.PdfCopy.flushTaggedObjects(PdfCopy.java:799)
at com.itextpdf.text.pdf.PdfDocument.close(PdfDocument.java:836)
at com.itextpdf.text.Document.close(Document.java:416)
at PDFMerger.mergePDF(PDFMerger.java:189)

Please let me know what could be the cause of this issue.

Below is the code I use.

PdfReader reader = new PdfReader(pdf);

boolean setTagged=reader.isTagged() ; 

Document document = new Document();

PdfCopy copy = new PdfCopy(document, new FileOutputStream("Merged.pdf"));

copy.setTagged();

document.open();

int n;
n = reader.getNumberOfPages();
for (int page = 0; page < n; ) {

    copy.addPage(copy.getImportedPage(reader, ++page,true));

}
copy.freeReader(reader);
document.close();
reader.close();

回答1:

This looks like a bug in the current iText versions.

@Bruno maybe someone should look into this

PdfCopy has a method fixTaggedStructure which tries to fix the tagged structure which has been somewhat garbled by copying tagged pages. Up to the current iText 5.4.6-SNAPSHOT inclusively you find the following code

PdfDictionary dict = (PdfDictionary)iobj.object;
PdfIndirectReference pg = (PdfIndirectReference)dict.get(PdfName.PG);
//if pg is real page - do nothing, else set correct pg and remove first MCID if exists
if (!pageReferences.contains(pg) && !pg.equals(currPage)){
    dict.put(PdfName.PG, currPage);
    PdfArray kids = dict.getAsArray(PdfName.K);
    if (kids != null) {
        PdfObject firstKid = kids.getDirectObject(0);
        if (firstKid.isNumber()) kids.remove(0);
    }
}

for a StructElem tagged element dict from some array. This code implicitly assumes that there is an entry for the key PdfName.PG in that dictionary dict by doing pg.equals(currPage). Unfortunately that entry is optional, e.g. the sample document provided by the OP contains such StructElem dictionaries referenced from some array without a Pg entry. This causes the NPE in question.

In this case it suffices to change the order in the equals call, i.e. instead of

if (!pageReferences.contains(pg) && !pg.equals(currPage)){

one should use

if (!pageReferences.contains(pg) && !currPage.equals(pg)){

or

if (pg != null && !pageReferences.contains(pg) && !pg.equals(currPage)){

depending on the actual program logic here.

@Bruno Please check which variant is semantically correct; I'm not really into this tagged structure stuff after all...



回答2:

The Code was written in C#

  public static byte[] mergeTest(byte[] pdf) {
        PdfReader reader = null;
        Document doc = null;
        PdfCopy copy = null;
        MemoryStream stream = new MemoryStream();
        byte[] output = null;

        try {
            reader = new PdfReader(pdf);
            doc = new Document();

            copy = new PdfCopy(doc, stream);
            bool tagged = reader.IsTagged();

            if (tagged)
                copy.SetTagged();


            doc.Open();

            for (int x = 1; x <= reader.NumberOfPages; x++) {
                copy.AddPage(copy.GetImportedPage(reader, x, tagged));
            }

            copy.FreeReader(reader);
            doc.Close();
            copy.Close();

            output = stream.ToArray();

            stream.Flush();
            stream.Dispose();

        } catch (Exception ex) {

        } finally {
            try {
                if (reader != null)
                    reader.Close();
            } catch (Exception) { }
        }
        return output;
    }