How to extract images from a PDF with iText in the

2019-01-17 17:43发布

问题:

I am trying to extract images from a PDF file. I found an example on the web, that worked fine:

    PdfReader reader;

    File file = new File("example.pdf");
    reader = new PdfReader(file.getAbsolutePath());
    for (int i = 0; i < reader.getXrefSize(); i++) {
        PdfObject pdfobj = reader.getPdfObject(i);
        if (pdfobj == null || !pdfobj.isStream()) {
            continue;
        }
        PdfStream stream = (PdfStream) pdfobj;
        PdfObject pdfsubtype = stream.get(PdfName.SUBTYPE);
        if (pdfsubtype != null && pdfsubtype.toString().equals(PdfName.IMAGE.toString())) {
            byte[] img = PdfReader.getStreamBytesRaw((PRStream) stream);
            FileOutputStream out = new FileOutputStream(new File(file.getParentFile(), String.format("%1$05d", i) + ".jpg"));
            out.write(img);
            out.flush();
            out.close();
        }
    }

That gave me all the images, but the images were in the wrong order. My next attempt looked like this:

for (int i = 0; i <= reader.getNumberOfPages(); i++) {
  PdfDictionary d = reader.getPageN(i);
  PdfIndirectReference ir = d.getAsIndirectObject(PdfName.CONTENTS);
  PdfObject o = reader.getPdfObject(ir.getNumber());
  PdfStream stream = (PdfStream) o;
  // rest from example above
}

Although o.isStream() == true, I only get /Length and /Filter and the stream is only about 100 bytes long. No image to be found at all.

My question would be what the correct way would be to get all the images from a PDF file in the correct order.

回答1:

I found an answer elsewhere, namely the iText mailing list.

The following code works for me - please note that I switched to PdfBox:

PDDocument document = null; 
document = PDDocument.load(inFile); 
List pages = document.getDocumentCatalog().getAllPages();
Iterator iter = pages.iterator(); 
while (iter.hasNext()) {
            PDPage page = (PDPage) iter.next();
            PDResources resources = page.getResources();
            Map pageImages = resources.getImages();
            if (pageImages != null) { 
                Iterator imageIter = pageImages.keySet().iterator();
                while (imageIter.hasNext()) {
                    String key = (String) imageIter.next();
                    PDXObjectImage image = (PDXObjectImage) pageImages.get(key);
                    image.write2OutputStream(/* some output stream */);
                }
            }
}


标签: java pdf itext