Using iTextPDF to trim a page's whitespace

2019-09-09 00:50发布

问题:

I have a pdf which comprises of some data, followed by some whitespace. I don't know how large the data is, but I'd like to trim off the whitespace following the data

    PdfReader reader = new PdfReader(PDFLOCATION);
    Rectangle rect = new Rectangle(700, 2000);
    Document document = new Document(rect);
    PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(SAVELCATION));

     document.open();

        int n = reader.getNumberOfPages();
        PdfImportedPage page;
        for (int i = 1; i <= n; i++) {
            document.newPage();
            page = writer.getImportedPage(reader, i);
            Image instance = Image.getInstance(page);
            document.add(instance);
        }
        document.close();

Is there a way to clip/trim the whitespace for each page in the new document? This PDF contains vector graphics.

I'm usung iTextPDF, but can switch to any Java library (mavenized, Apache license preferred)

回答1:

As no actual solution has been posted, here some pointers from the accompanying itext-questions mailing list thread:

  1. As you want to merely trim pages, this is not a case of PdfWriter + getImportedPage usage but instead of PdfStamper usage. Your main code using a PdfStamper might look like this:

    PdfReader reader = new PdfReader(resourceStream); 
    PdfStamper stamper = new PdfStamper(reader, new FileOutputStream("target/test-outputs/test-trimmed-stamper.pdf")); 
    
    // Go through all pages 
    int n = reader.getNumberOfPages(); 
    for (int i = 1; i <= n; i++) 
    { 
        Rectangle pageSize = reader.getPageSize(i); 
        Rectangle rect = getOutputPageSize(pageSize, reader, i); 
    
        PdfDictionary page = reader.getPageN(i); 
        page.put(PdfName.CROPBOX, new PdfArray(new float[]{rect.getLeft(), rect.getBottom(), rect.getRight(), rect.getTop()})); 
        stamper.markUsed(page); 
    } 
    stamper.close(); 
    

    As you see I also added another argument to your getOutputPageSize method to-be. It is the page number. The amount of white space to trim might differ on different pages after all.

  2. If the source document did not contain vector graphics, you could simply use the iText parser package classes. There even already is a TextMarginFinder based on them. In this case the getOutputPageSize method (with the additional page parameter) could look like this:

    private Rectangle getOutputPageSize(Rectangle pageSize, PdfReader reader, int page) throws IOException 
    { 
        PdfReaderContentParser parser = new PdfReaderContentParser(reader);
        TextMarginFinder finder = parser.processContent(page, new TextMarginFinder());
        Rectangle result = new Rectangle(finder.getLlx(), finder.getLly(), finder.getUrx(), finder.getUry());
        System.out.printf("Text/bitmap boundary: %f,%f to %f, %f\n", finder.getLlx(), finder.getLly(), finder.getUrx(), finder.getUry());
        return result;
    }
    

    Using this method with your file test.pdf results in:

    As you see the code trims according to text (and bitmap image) content on the page.

  3. To find the bounding box respecting vector graphics, too, you essentially have to do the same but you have to extend the parser framework used here to inform its listeners (the TextMarginFinder essentially is a listener to drawing events sent from the parser framework) about vector graphics operations, too. This is non-trivial, especially if you don't know PDF syntax by heart yet.

  4. If your PDFs to trim are not too generic but can be forced to include some text or bitmap graphics in relevant positions, though, you could use the sample code above (probably with minor changes) anyways.

    E.g. if your PDFs always start with text on top and end with text at the bottom, you could change getOutputPageSize to create the result rectangle like this:

    Rectangle result = new Rectangle(pageSize.getLeft(), finder.getLly(), pageSize.getRight(), finder.getUry());
    

    This only trims top and bottom empty space:

    Depending on your input data pool and requirements this might suffice.

    Or you can use some other heuristics depending on your knowledge on the input data. If you know something about the positioning of text (e.g. the heading to always be centered and some other text to always start at the left), you can easily extend the TextMarginFinder to take advantage of this knowledge.


Recent (April 2015, iText 5.5.6-SNAPSHOT) improvements

The current development version, 5.5.6-SNAPSHOT, extends the parser package to also include vector graphics parsing. This allows for an extension of iText's original TextMarginFinder class implementing the new ExtRenderListener methods like this:

@Override
public void modifyPath(PathConstructionRenderInfo renderInfo)
{
    List<Vector> points = new ArrayList<Vector>();
    if (renderInfo.getOperation() == PathConstructionRenderInfo.RECT)
    {
        float x = renderInfo.getSegmentData().get(0);
        float y = renderInfo.getSegmentData().get(1);
        float w = renderInfo.getSegmentData().get(2);
        float h = renderInfo.getSegmentData().get(3);
        points.add(new Vector(x, y, 1));
        points.add(new Vector(x+w, y, 1));
        points.add(new Vector(x, y+h, 1));
        points.add(new Vector(x+w, y+h, 1));
    }
    else if (renderInfo.getSegmentData() != null)
    {
        for (int i = 0; i < renderInfo.getSegmentData().size()-1; i+=2)
        {
            points.add(new Vector(renderInfo.getSegmentData().get(i), renderInfo.getSegmentData().get(i+1), 1));
        }
    }

    for (Vector point: points)
    {
        point = point.cross(renderInfo.getCtm());
        Rectangle2D.Float pointRectangle = new Rectangle2D.Float(point.get(Vector.I1), point.get(Vector.I2), 0, 0);
        if (currentPathRectangle == null)
            currentPathRectangle = pointRectangle;
        else
            currentPathRectangle.add(pointRectangle);
    }
}

@Override
public Path renderPath(PathPaintingRenderInfo renderInfo)
{
    if (renderInfo.getOperation() != PathPaintingRenderInfo.NO_OP)
    {
        if (textRectangle == null)
            textRectangle = currentPathRectangle;
        else
            textRectangle.add(currentPathRectangle);
    }
    currentPathRectangle = null;

    return null;
}

@Override
public void clipPath(int rule)
{
}

(Full source: MarginFinder.java)

Using this class to trim the white space results in

which is pretty much what one would hope for.

Beware: The implementation above is far from optimal. It is not even correct as it includes all curve control points which is too much. Furthermore it ignores stuff like line width or wedge types. It actually merely is a proof-of-concept.

All test code is in TestTrimPdfPage.java.