Currently I'm extracting the text of PDF's with the itextsharp tool (in VB.net).
I'd like to be independent of other tools / libraries as I can't give them to others along my programm.
Is there a solution (no .dll etc) in any programming language to quickly extract the text of a PDF?
Short answer:
Of course there is a way of doing this. iText (alongside many other PDF libraries) are capable of doing it. So there is an algorithm for extracting text.
Long answer:
PDF is not a WYSIWYG format.
A PDF document is sort of an ungodly marriage between "objects that reference eachother" and "programming language".
Let me explain.
A PDF document has a graphics state. So whenever you see text in a PDF document (in a viewer like Adobe Reader), you are essentially seeing the result of some 'code' in the PDF document that says
Go to position 50, 720
Set the active font to Helvetica, fontsize 12
Set the active drawing color to black
draw the glyph that corresponds to the character 'H'
Go to position 53, 720
draw the glyph that corresponds to the character 'e'
etc
Instructions and resources (like fonts, images, vector graphics) can be grouped together in objects.
Each object is assigned a number, and is mentioned explictly in the cross-reference table (at the end of the PDF document).
So, in order to read the text from a PDF document you would need to:
- read the XREF table
- figure out where (byte location) the \page objects start
- parse the \page object and all its sub objects (again using the XREF table to figure out where in the file each of these sub objects are)
- parse geometrical instructions (the graphics state does not need to flow in the same direction as the text)
- sort all visible characters (comparing background and foreground color, occlusion by other objects such as images, etc) according to the direction you expect the text to be written in
- build the return string
And that is probably why other people use libraries.
Don't get me wrong, I'm a huge fan of doing it yourself (it's the best way to gain a deep knowledge on how certain things work).
But look at it from the point of view of one of your users.
What would you trust more?
- A program that uses 'self written' code to handle PDF documents (total experience in parsing PDF documents < 1 year),
- or a program that simply calls a PDF library (total experience in
parsing PDF documents > 20 years)