How is hidden text stored in OCR-enhanced PDF file

2019-06-08 15:14发布

问题:

// EDIT 26.03.2018 - Who wants to continue my work can have a look on my source-files https://github.com/n0l0cale/ocr-sampledata

I'm actually looking for some details about PDF Files. It's most important for me that the files will be usable for a very long time and if possible the OCR should be automatically applied for new files (which seems to be not really possible with Adobe Acrobat...).

For that I've been looking for different solutions how to OCR my PDF Files. I found three candidates which seems to be doing what they should do... (more or less). But all three variants have their pro&cons... But there seem to be different approaches how to store data in PDF Files.... for all three Variants... Let me explain:

  • a File OCRed with Adobe Acrobat:

    https://github.com/n0l0cale/ocr-sampledata/blob/master/A4%20sample_ACROBAT.pdf

    results in a file that Acrobat is able to open in one step (no preloading of any background layer) and after a preflight-script I'm able to see the text which is stored hidden:

  • a File OCRed with Abby Finereader:

    https://github.com/n0l0cale/ocr-sampledata/blob/master/A4%20sample_ABBY.pdf

    does not seem suitable for the default adobe preflight-script as it does not display any additional layers:

    But far as I was able to reproduce these Files seems to have a Background-Text-Layer, which contains the OCRed Text, which is the underlying layer for the Image that is shown to the user at the end. Unfortunately this seems to be loaded separately and this is confusing while opening the file with Adobe Acrobat...

  • a File OCRed with Tesseract 4 (Alpha):

    https://github.com/n0l0cale/ocr-sampledata/blob/master/A4%20sample_TESSERACT_oem2.pdf

    is also doing some weird magic with the hidden text part:

But in all three cases I'm able to search for words in the files and see the text using "Remove hidden information" and selecting "hidden text":

I'm seriously confused.... Does anyone know how these programs are storing their hidden text information really?

S.

P.S.: For those wondering what this ominous preflight script is: https://theblog.adobe.com/hidden-gems-in-acrobat-dc-how-to-optimize-hidden-ocr-text/

回答1:

Does anyone know how these programs are storing their hidden text information really?

You correctly have found out that the approach of Abby Finereader is different from that of Adobe Acrobat and of Tesseract:

  • Abby creates a page content stream in which first the text is drawn normally on the page and eventually covered by the scanned image.
  • Acrobat and Tesseract create content streams in which first the image is drawn and then the text is drawn invisibly (using text rendering mode 3 which draws nothing).

The difference between the latter two results is the choice of font used:

  • Acrobat uses regular standard 14 fonts for which a PDF viewer has a font program to render them as normal glyphs.
  • Tesseract uses a font GlyphLessFont it embeds a font program for into the result file. When rendered the glyphs in this font do not show as our normal Latin glyphs but merely as empty space.

Considering the visual effect you observed for the Abby result, the approach used by Acrobat or Tesseract might be preferable.

Whether one prefers fonts with visually recognizable glyphs (as used by Acrobat) or without (as used by Tesseract), is mostly a mere matter of taste. They are used only in the invisible rendering mode anyways.



标签: pdf ocr