Check if a PDF file is a scanned one

2019-01-19 13:56发布

问题:

What is the best way to programmatically check if a PDF file is a totally scanned one? I do have iText and PDFBox at my disposal. I can check if a pdf file contains text or not, and according to the result to decide if this file is OCRed, but this solution is not 100% accurate. I'd like to know whether there is another way to cope with the problem.

As you understand the solution must be Java based.

回答1:

Your best bet might be to check to see if it has text and also see if it contains a large pagesized image or lots of tiled images which cover the page. If you also check the metadata this should cover most options.



回答2:

IMHO you cannot decide that for sure. But you can try some things like looking for the text, trying to OCR the pdf and based on amount of recognized text decide, you can look for some basic scanning errors like fade-outs or paper/book margins.



回答3:

You can check to see if a PDF has any font resources (a pretty good indication of whether or not the document contains any fonts) using the HasFontResources function in Quick PDF Library Lite -- a free ActiveX component, which you could theoretically use from Java with the assistance of a third-party add-on.

Checking for text/font resources is the most accurate method for determining if a PDF may have been generated from a scanning process. That coupled with Mark Stephens suggestion of looking for a large page sized image, etc.

But unfortunately, there isn't any 100% guaranteed accurate method for checking to see if a PDF was scanned.



回答4:

Do you have any knowledge of how the document would have been scanned, if it was? While the "Creator" metadata item is not mandatory, it could possibly be a useful clue if your scanner sets it.



回答5:

I simply judge that by size. Scanned documents are unreasonable large. For till 1000 pages, my rule of thumb is, true text pdf: 1-20 M, the scanned one can be up 30 to 100 M.



回答6:

find ./ -name "*.pdf" -print0 | xargs -0 -I {} bash -c 'export file="{}"; if [ $(pdffonts "$file" 2> /dev/null | wc -l) -lt 3 ]; then echo "$file"; fi'

Explanation: pdffonts file.pdf will show more than 2 lines if pdf contains text. Outputs filenames of all pdf files that don't contain text, so are scanned PDFs.



标签: java pdf ocr