I am trying to extract images from a pdf . pdfbox is able to extract images from most of the pdfs but their are some pdfs whose images are not getting extracted by pdfbox.
For extracting the image I am using following code :
Not able to extract images from PDFA1-a format document
You can download a sample pdf with this problem from this link :
http://myslams.com/test/2.pdf
is their something wrong the code maybe something I forgot to handle or is their something wrong with the pdf all together ?
As the OP has not yet replaced his stale sample PDF link by a working one, the question can only be answered in general terms.
The code referenced by the OP (with the corrections in the answer of @Tilman) iterates the immediate image resources of each page and stores the respective files.
Thus, the code may store too many images because image resources of a page may not necessarily be used on the page in question:
- On one hand it may not be used at all in the file or at least nowhere visible, merely a left-over from some prior PDF editing session.
- On the other hand multiple pages may have a shared resources dictionary containing all images on all these pages; in this case the OP's code exports many duplicates.
And the code may store too few images because there are other places where images may be put:
- Image data may be directly included in the page content stream, aka inline images.
- Constructs with their own resources (form xobjects, patterns, Type 3 font glyphs) used from the page content may provide their own image resources or inline immages.
- Annotations, e.g. AcroForm form fields, may have also their own appearance streams with their own resources and, therefore, may provide their own image resources or inline immages, too.
- XFA forms may provide their own images, too.
As soon as the OP provides a representative sample file, the type of images he misses can be determined and a specific solution may be outlined.
EDIT
According to a comment by the OP, his image extraction problems have been resolved by making use of the information from this answer to his question "pdfbox and itext extracting image with incorrect dpi". Especially pointing to example code appropriate for the PDFBox version 1.8.8 used by the OP sems to have been important.
Thus, any kind of wrong output may also occur as a result of software orchestration issues.