I am using the following code for extracting images from pdf which is in PDFA1-a format but I am not able to get the images .
List<PDPage> list = document.getDocumentCatalog().getAllPages();
String fileName = oldFile.getName().replace(".pdf", "_cover");
int totalImages = 1;
for (PDPage page : list) {
PDResources pdResources = page.findResources();
Map pageImages = pdResources.getImages();
if (pageImages != null) {
InputStream xmlInputStream = null;
Iterator imageIter = pageImages.keySet().iterator();
while (imageIter.hasNext()) {
String key = (String) imageIter.next();
PDXObjectImage pdxObjectImage = (PDXObjectImage) pageImages.get(key);
System.out.println(convertStreamToString(xmlInputStream));
System.out.println(pdxObjectImage.hashCode());
System.out.println(pdxObjectImage.getColorSpace().getJavaColorSpace().isCS_sRGB());
pdxObjectImage.write2file(destinationDir + fileName+ "_" + totalImages);
totalImages++;
break;
}
}
}
I am able to extract images for notmal PDFs using above code but am not able to extract it for PDFA1-a format pdfs. It seems the following line
PDResources pdResources = page.findResources();
is not returning images I have even tried page.getResources() but still not getting any images.I have even tried to use itext but still it is not giving me any images.
If i try to convert the page of PDF to image using the following code
BufferedImage bufferedImage = page.convertToImage();
File outputfile = new File(destinationDir+"image1.JPEG");
ImageIO.write(bufferedImage, "JPEG", outputfile);
these images seem to have no metadata associated with them So I still am not able to know their dpi or whether they are color or grey scale.
Currently I am using PDFBox for doing this.I have already spent 2 days on this searching on google but still I havent found any code or documentation for doing this.
How to do this in java ??
Is it possible to get DPI or whether the pdf is color or black and white without extracting the images ??
Your problems are a combination of two problems:
1) the "break;". Your file has two images. The first one is transparent or grey or whatever and JPEG encoded, but it isn't the one you want. The second one is the one you want but the break aborts after the first image. So I just changed a code segment of yours to this:
while (imageIter.hasNext())
{
String key = (String) imageIter.next();
PDXObjectImage pdxObjectImage = (PDXObjectImage) pageImages.get(key);
System.out.println(totalImages);
pdxObjectImage.write2file("C:\\SOMEPATH\\" + fileName + "_" + totalImages);
totalImages++;
//break;
}
2) Your second image (the interesting one) is JBIG2 encoded. To decode this, you need to add the levigo plugin your class path, as mentioned here. If you don't, you'll get this message in 1.8.8, unless you disabled logging:
ERROR [main] org.apache.pdfbox.filter.JBIG2Filter:69 - Can't find an ImageIO plugin to decode the JBIG2 encoded datastream.
(You didn't get that error message because it is the second one that is JBIG2 encoded)
Three bonus hints:
3) if you created this image yourself, e.g. on a photocopy machine, find out how to get PDF images without JBIG2 compression, it is somewhat risky.
4) don't use pdResources.getImages(), the getImages call is deprecated. Instead, use getXObjects(), and then check the type of what you get when iterating.
Iterator imageIter = pageImages.keySet().iterator();
while (imageIter.hasNext())
{
String key = (String) imageIter.next();
Object o = pageImages.get(key);
if (o instanceof PDXObjectImage)
{
PDXObjectImage pdxObjectImage = (PDXObjectImage) o;
// do stuff
}
}
5) use a foreach loop.
And if it wasn't already obvious: this has nothing to do with PDF/A :-)
6) I forgot you also asked how to see if it is a b/w image, here's some simple code (not optimized) that I mentioned in the comments:
BufferedImage bim = pdxObjectImage.getRGBImage();
boolean bwImage = true;
int w = bim.getWidth();
int h = bim.getHeight();
for (int y = 0; y < h; y++)
{
for (int x = 0; x < w; x++)
{
Color c = new Color(bim.getRGB(x, y));
int red = c.getRed();
int green = c.getGreen();
int blue = c.getBlue();
if (red == 0 && green == 0 && blue == 0)
{
continue;
}
if (red == 255 && green == 255 && blue == 255)
{
continue;
}
bwImage = false;
break;
}
if (!bwImage)
break;
}
System.out.println(bwImage);