My Question
I'm looking for a way to convert the individual pdf pages into a byte[] (as in one byte[] per pdf page) so that I can then cast them to BufferedImage[].
This way, all the conversion is done in memory instead of making temporary files, making it faster and less messy. I may use the byte array for service calls later on as well. It would be nice if I could keep the library use to only itext, however, if there isn't any other way, I'm open to other libraries.
What I have now
This is the code that I currently have
public static BufferedImage toBufferedImage(byte[] input) throws IOException {
InputStream in = new ByteArrayInputStream(input);
BufferedImage bimg = ImageIO.read(in);
return bimg;
}
public static BufferedImage[] extract(final String fileName) throws IOException {
PdfReader reader = new PdfReader(fileName);
int pageNum = reader.getNumberOfPages();
BufferedImage[] imgArray = new BufferedImage[pageNum];
for (int page = 0; page < pageNum; page++) {
//TODO: You may need to decode the bytearray first?
imgArray[page] = toBufferedImage(reader.getPageContent(pageNum));
}
reader.close();
return imgArray;
}
public static void convert() throws IOException {
String fileName = getProps("file_in");
BufferedImage[] bim = extract(fileName);
// close streams; Closed implicitily by try-with-resources
}
And here's a (non-representative) list of the links that I've checked out so far.
I did some digging and came up with a solution! Hopefully someone else finds this when they need it, and that it helps as much as possible. Cheers!
Extending the RenderListener Class
I looked around and found this. Looking through the code and classes, I found that PdfImageObjects have a
getBufferedImage()
which is exactly what I was looking for. Now there's no need to convert to abyte[]
, which is what I originally thought I was going to have to do. Using the given example code, I came up with this class:Important changes to notice here compared to the link above are the additions of a new field
ArrayList<BufferedImage> bimg
, a getter for that field, and a restructuring of therenderImage()
function.I also changed some of the methods in the other class of my project:
Code for Bursting PDF to BufferedImage[]
Code for Converting BufferedImage[] to Multi-Page Tiff
To kick off the whole process:
IMPORTANT TO NOTE:
You may notice that
listener.renderImage()
is never explicitly called in my code. It seems thatrenderImage()
is a helper function that is called somewhere else when the listener object is passed into the parser object. This happens in thegetBufImgArr(param)
method.As @mkl in the comments below has noted, the code is extracting all images in the pdf page, since a pdf page isn't an image in and of itself. Problems may occur if you're running this code on pdf's that were scanned in using OCR, or pdf's that have multiple layers. In this scenario, you'd have multiple images from a single pdf page being converted into multiple tiff images, when you (may) want them to stay together on a single page.
Good sources I found:
Programcreek search for PdfReaderContentParser