Convert Pdf pages to Byte array using Itext

2019-07-20 13:27发布

My Question

I'm looking for a way to convert the individual pdf pages into a byte[] (as in one byte[] per pdf page) so that I can then cast them to BufferedImage[].

This way, all the conversion is done in memory instead of making temporary files, making it faster and less messy. I may use the byte array for service calls later on as well. It would be nice if I could keep the library use to only itext, however, if there isn't any other way, I'm open to other libraries.

What I have now

This is the code that I currently have

public static BufferedImage toBufferedImage(byte[] input) throws IOException {
    InputStream in = new ByteArrayInputStream(input);
    BufferedImage bimg = ImageIO.read(in);
    return bimg;
}

public static BufferedImage[] extract(final String fileName) throws IOException {
    PdfReader reader = new PdfReader(fileName);
    int pageNum = reader.getNumberOfPages();
    BufferedImage[] imgArray = new BufferedImage[pageNum];

    for (int page = 0; page < pageNum; page++) {
        //TODO: You may need to decode the bytearray first?
        imgArray[page] = toBufferedImage(reader.getPageContent(pageNum)); 
    }

    reader.close();
    return imgArray;
}

public static void convert() throws IOException {
    String fileName = getProps("file_in");
        BufferedImage[] bim = extract(fileName);
        // close streams; Closed implicitily by try-with-resources

}

And here's a (non-representative) list of the links that I've checked out so far.

Useful, but not quite what I want

Uses a different library

1条回答
走好不送
2楼-- · 2019-07-20 13:58

I did some digging and came up with a solution! Hopefully someone else finds this when they need it, and that it helps as much as possible. Cheers!

Extending the RenderListener Class

I looked around and found this. Looking through the code and classes, I found that PdfImageObjects have a getBufferedImage() which is exactly what I was looking for. Now there's no need to convert to a byte[], which is what I originally thought I was going to have to do. Using the given example code, I came up with this class:

public class MyImageRenderListener implements RenderListener {

protected String path = "";
protected ArrayList<BufferedImage> bimg = new ArrayList<>();

/**
 * Creates a RenderListener that will look for images.
 */
public MyImageRenderListener(String path) {
    this.path = path;
}

public ArrayList<BufferedImage> getBimgArray() {
    return bimg;
}

/**
 * @see com.itextpdf.text.pdf.parser.RenderListener#renderImage(
 * com.itextpdf.text.pdf.parser.ImageRenderInfo)
 */
public void renderImage(ImageRenderInfo renderInfo) {
    try {

        PdfImageObject image = renderInfo.getImage();

        if (image == null) {
            return;
        }
        bimg.add(image.getBufferedImage());

    } catch (IOException e) {
        System.out.println(e.getMessage());
    }
}

Important changes to notice here compared to the link above are the additions of a new field ArrayList<BufferedImage> bimg, a getter for that field, and a restructuring of the renderImage() function.

I also changed some of the methods in the other class of my project:

Code for Bursting PDF to BufferedImage[]

// Credit to Mihai. Code found here: http://stackoverflow.com/questions/6851385/save-tiff-ccittfaxdecode-from-pdf-page-using-itext-and-java
public static ArrayList<BufferedImage> getBufImgArr(final String BasePath) throws IOException {

    PdfReader reader = new PdfReader(BasePath);
    PdfReaderContentParser parser = new PdfReaderContentParser(reader);
    MyImageRenderListener listener = new MyImageRenderListener(BasePath + "extract/image%s.%s");

    for (int page = 1; page <= reader.getNumberOfPages(); page++) {
        parser.processContent(page, listener);
    }

    reader.close();
    return listener.getBimgArray();

}

Code for Converting BufferedImage[] to Multi-Page Tiff

public static void convert(String fin) throws FileNotFoundException, IOException {

    ArrayList<BufferedImage> bimgArrL = getBufImgArr(fin);
    BufferedImage[] bim = new BufferedImage[bimgArrL.size()];
    bimgArrL.toArray(bim);

    try (RandomAccessOutputStream rout = new FileCacheRandomAccessOutputStream(
        new FileOutputStream("/path/you/want/result/to/go.tiff"))) {

        // The options for the tiff file are set here. 
        // **THIS BLOCK USES THE ICAFE LIBRARY TO CONVERT TO MULTIPAGE-TIFF**
        // ICAFE: https://github.com/dragon66/icafe
        ImageParam.ImageParamBuilder builder = ImageParam.getBuilder();
        TIFFOptions tiffOptions = new TIFFOptions();
        tiffOptions.setApplyPredictor(true);
        tiffOptions.setTiffCompression(Compression.CCITTFAX4);
        tiffOptions.setDeflateCompressionLevel(0);
        builder.imageOptions(tiffOptions);
        TIFFTweaker.writeMultipageTIFF(rout, bim);
        // I found this block of code here: https://github.com/dragon66/icafe/wiki
        // About 3/4 of the way down the page

    }
}

To kick off the whole process:

public static void main(String[] args){
    convert("/path/to/pdf/image.pdf");
}

IMPORTANT TO NOTE:

You may notice that listener.renderImage() is never explicitly called in my code. It seems that renderImage() is a helper function that is called somewhere else when the listener object is passed into the parser object. This happens in the getBufImgArr(param) method.

As @mkl in the comments below has noted, the code is extracting all images in the pdf page, since a pdf page isn't an image in and of itself. Problems may occur if you're running this code on pdf's that were scanned in using OCR, or pdf's that have multiple layers. In this scenario, you'd have multiple images from a single pdf page being converted into multiple tiff images, when you (may) want them to stay together on a single page.

Good sources I found:

Programcreek search for PdfReaderContentParser

查看更多
登录 后发表回答