Determine whether a PDF page contains text or is p

How to determine whether a PDF page contains text or is purely picture, using Java?

I searched through many forums and websites, but I can not find an answer yet .

Is it possible to extract text from PDF, to know if the page is in the format picture or text?

PdfReader reader = new PdfReader(INPUTFILE);  
        PrintWriter out = new PrintWriter(new FileOutputStream(OUTPUTFILE));              
        for (int i = 1; i <= reader.getNumberOfPages(); i++) { 
         // here I want to test the structure of the page !!!! if it's possible                         
         out.println(PdfTextExtractor.getTextFromPage(reader, i));  
        }

标签： java parsing itext pdfbox

1条回答

在下西门庆

2楼-- · 2019-04-23 09:43

There is no water-proof way to do what you want.

Text can appear in different ways inside a PDF file. For instance: one can draw all the glyphs using graphics state operators instead of using text state. (I'm sorry if this sounds like Chinese to you, but I can assure you it's proper PDF language.)

If an ad hoc solution that covers the most common situations and misses an exotic PDF once in a while is OK for you, then you already have a good first workaround.

In your code, you loop over all the pages, and you ask iText if there's any text on the page. That's already a good indication.

Internally, your code is using the RenderListener interface. iText parses the content of a page and triggers methods in a specific RenderListener implementation. This is an implementation of a custom implementation: MyTextRenderListener. This custom implementation is used in the ParsingHelloWorld example.

There's also a renderImage() method (see for instance MyImageListener). If this method is triggered, you're 100% sure that there's also an Image in the page, and you can use the ImageRenderInfo object to obtain the position, width and the height of the image (that is: if you know how to interpret the Matrix returned by the getImageCTM() method).

Using all these elements, you can already get a long way to achieving what you need, but be aware that there will always be exotic PDFs that will escape all your checks.

0人赞添加讨论(0) 举报

Determine whether a PDF page contains text or is p

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间