When I extract an image using pdfbox I am getting incorrect dpi of the image for some PDFs. When I extract an image using Photoshop or Acrobat Reader Pro I can see that the dpi of the image is 200 using windows photo viewer, but when I extract the image using pdfbox the dpi is 72.
For extracting the image I am using following code : Not able to extract images from PDFA1-a format document
When I check the logs I see an unusual entry: 2015-01-23-main--DEBUG-org.apache.pdfbox.util.TIFFUtil:
<?xml version="1.0" encoding="UTF-8"?><javax_imageio_jpeg_image_1.0> <JPEGvariety> <app0JFIF majorVersion="1" minorVersion="2" resUnits="0" Xdensity="1" Ydensity="1" thumbWidth="0" thumbHeight="0"/> </JPEGvariety> <markerSequence> <dqt> <dqtable elementPrecision="0" qtableId="0"/> <dqtable elementPrecision="0" qtableId="1"/> </dqt> <dht> <dhtable class="0" htableId="0"/> <dhtable class="0" htableId="1"/> <dhtable class="1" htableId="0"/> <dhtable class="1" htableId="1"/> </dht> <sof process="0" samplePrecision="8" numLines="0" samplesPerLine="0" numFrameComponents="3"> <componentSpec componentId="1" HsamplingFactor="2" VsamplingFactor="2" QtableSelector="0"/> <componentSpec componentId="2" HsamplingFactor="1" VsamplingFactor="1" QtableSelector="1"/> <componentSpec componentId="3" HsamplingFactor="1" VsamplingFactor="1" QtableSelector="1"/> </sof> <sos numScanComponents="3" startSpectralSelection="0" endSpectralSelection="63" approxHigh="0" approxLow="0"> <scanComponentSpec componentSelector="1" dcHuffTable="0" acHuffTable="0"/> <scanComponentSpec componentSelector="2" dcHuffTable="1" acHuffTable="1"/> <scanComponentSpec componentSelector="3" dcHuffTable="1" acHuffTable="1"/> </sos> </markerSequence> </javax_imageio_jpeg_image_1.0>
I tried to google but I can see to find out what pdfbox means by this log. What does this mean?
You can download a sample pdf with this problem from this link: http://myslams.com/test/1.pdf
I have even tried itext but it is extracting image with 96 dpi.
Am I doing something wrong? Or pdfbox and itext have this limitation?
After some digging I found your 1.pdf. Thus,...
PDFBox
In comments to this recent answer @Tilman and you were discussing this older answer in which @Tilman pointed towards the PrintImageLocations PDFBox example. I ran it for your file and got:
On all pages this amounts to 200 dpi both in x and y directions (1704px / 8.52in = 888px / 4.44in = 2800px / 14.0in = 1464px / 7.32in = 200 dpi).
So PDFBox gives you the dpi values you are after.
(@Tilman: The current 2.0.0-SNAPSHOT version of that sample returns utter nonsense; you might want to fix this.)
iText
A simplified iText version of that PDFBox example would be this:
(Beware: I assumed unrotated and unskewed images.)
The results for your file:
Thus, also 200dpi all along. So iText, too, gives you the dpi values you are after.
Your code
Obviously the code you referenced had no chance to report a dpi value sensible in the context of the PDF because it only extracts the images as found in the resources but ignores how the respective image resource is used on the page.
An image resource can be stretched, rotated, skewed, ... any way the author likes when he uses it in the page content.
BTW, a dpi value only makes sense if the author did not skew and rotated only by a multiple of 90°.