How to determine whether a PDF page contains text or is purely picture, using Java?
I searched through many forums and websites, but I can not find an answer yet .
Is it possible to extract text from PDF, to know if the page is in the format picture or text?
PdfReader reader = new PdfReader(INPUTFILE);
PrintWriter out = new PrintWriter(new FileOutputStream(OUTPUTFILE));
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
// here I want to test the structure of the page !!!! if it's possible
out.println(PdfTextExtractor.getTextFromPage(reader, i));
}
There is no water-proof way to do what you want.
Text can appear in different ways inside a PDF file. For instance: one can draw all the glyphs using graphics state operators instead of using text state. (I'm sorry if this sounds like Chinese to you, but I can assure you it's proper PDF language.)
If an ad hoc solution that covers the most common situations and misses an exotic PDF once in a while is OK for you, then you already have a good first workaround.
In your code, you loop over all the pages, and you ask iText if there's any text on the page. That's already a good indication.
Internally, your code is using the RenderListener
interface. iText parses the content of a page and triggers methods in a specific RenderListener
implementation. This is an implementation of a custom implementation: MyTextRenderListener. This custom implementation is used in the ParsingHelloWorld example.
There's also a renderImage()
method (see for instance MyImageListener). If this method is triggered, you're 100% sure that there's also an Image in the page, and you can use the ImageRenderInfo
object to obtain the position, width and the height of the image (that is: if you know how to interpret the Matrix
returned by the getImageCTM()
method).
Using all these elements, you can already get a long way to achieving what you need, but be aware that there will always be exotic PDFs that will escape all your checks.