I have a bunch of PDF files that came from scanned documents. The files contain a mix of images and text. Some were scanned as images with no OCR, so each PDF page is one large image, even where the whole page is entirely text. Others were scanned with OCR and contain images and searchable text where text is present. In many cases even words in the images were made searchable.
I want to make an automated process to recognize the text in all of the scanned documents using OCR, with Acrobat 8 Pro, but I don't want to re-OCR the files that have already been through the OCR process in the past. Does anyone know if there is a way to tell which ones contain only images, and which ones already contain searchable text?
I'm planning on doing this in C# or VB.NET but I don't think being able to tell the two kinds of files apart is language dependent.
Open the document in acrobat. Go to File -> Properties. Look in the "Advanced" section and find the PDF Producer. If it reads something like "Paper Capture..." then it has been OCR'd.
Hope this helps.
Use "dtsearch" to create an index for all the pdf files... then "view the log file" of the indexing process to check the list of pdf files that were not indexed.
Scannned images converted to PDF which have been OCR'ed in the aftermath to make text searchable do normally contain the text parts rendered as "invisible". So what you see on screen (or on paper when printed) is still the original image. But when you search successfully, you get the hits highlighted that are on the invisible text.
I'd recommend you to look at the XPDF-derived commandline tools
pdffonts(.exe)
,pdfinfo(.exe)
andpdftotext(.exe)
. See here for downloads: http://www.foolabs.com/xpdf/download.htmlExample usage of
pdffonts
:This PDF uses fonts (indicated by the 'name' column), has them embedded (indicated by the 'yes' in the 'emb' column) and uses subset fonts (indicated by the 'yes' in the 'sub' column).
This PDF uses 2 fonts (indicated by the 'name' column). The font 'Universe-BlackOblique' is embedded completely (indicated by the 'yes' in the 'emb' column and the 'no' in the 'sub' column). The font 'Arial' is also used, but is not embedded.
This PDF uses not a single font, and hence does not have any text embedded (so no OCR either).
Example usage of
pdftotext
:This will extract all text strings from the PDF (trying to preserve some resemblance of the original layout). If there is no text in the PDF, you'd know there was no OCR...
Various PDF tools can tell you if there's text. Some are available as COM controls, and maybe even native .NET ones.
Sorry to dig up old thread, but if you found this have a look at my thread:
Batch OCR Program for PDFs
you can get extra information about the pdf by catting it in unix/linux/osx or opening it as "rb" mode in python. (course that's python and you didn't want to use that but maybe it has something equivalent).
A very low tech solution: any file that has scanned text will undoubtedly contain the letter "a" so do a search on all file contents that don't contain the letter a. i.e. "NOT a". Any file that shows up won't have been OCR'd