I am trying to find the text position in PDF page?
What I have tried is to get the text in the PDF page by PDF Text Extractor using simple text extraction strategy. I am looping each word to check if my word exists. split the words using:
var Words = pdftextextractor.Split(new char[] { ' ', '\n' });
What I wasn't able to do is to find the text position. The problem is I wasn't able to find the location of the text. All I need to find is the y co-ordinates of the word in the PDF file.
@Joris' answer explains how to implement a completely new extraction strategy / event listener for the task. Alternatively one can try and tweak an existing text extraction strategy to do what you required.
This answer demonstrates how to tweak the existing
LocationTextExtractionStrategy
to return both the text and its characters' respective y coordinates.Beware, this is but a proof-of-concept which in particular assumes text to be written horizontally, i.e. using an effective transformation matrix (ctm and text matrix combined) with b and c equal to 0. Furthermore the character and coordinate retrieval methods of
TextPlusY
are not at all optimized and might take long to execute.As the OP did not express a language preference, here a solution for iText7 for Java:
TextPlusY
For the task at hand one needs to be able to retrieve character and y coordinates side by side. To make this easier I use a class representing both text its characters' respective y coordinates. It is derived from
CharSequence
, a generalization ofString
, which allows it to be used in manyString
related functions:(TextPlusY.java)
TextPlusYExtractionStrategy
Now we extend the
LocationTextExtractionStrategy
to extract aTextPlusY
instead of aString
. All we need for that is to generalize the methodgetResultantText
.Unfortunately the
LocationTextExtractionStrategy
has hidden some methods and members (private
or package protected) which need to be accessed here; thus, some reflection magic is required. If your framework does not allow this, you'll have to copy the whole strategy and manipulate it accordingly.(TextPlusYExtractionStrategy.java)
Usage
Using these two classes you can extract text with coordinates and search therein like this:
(ExtractTextPlusY test method
testExtractTextPlusYFromTest
)For my test document
the output of the test code above is
My locale uses the comma as decimal separator, you might see
666.9
instead of666,9
.The extra spaces you see can be removed by fine-tuning the base
LocationTextExtractionStrategy
functionality further. But that is the focus of other questions...First, SimpleTextExtractionStrategy is not exactly the 'smartest' strategy (as the name would suggest.
Second, if you want the position you're going to have to do a lot more work. TextExtractionStrategy assumes you are only interested in the text.
Possible implementation:
how to:
ITextExtractionStrategy has the following method in its interface:
Important to keep in mind is that rendering instructions in a pdf do not need to appear in order. The text "Lorem Ipsum Dolor Sit Amet" could be rendered with instructions similar to: render "Ipsum Do"
render "Lorem "
render "lor Sit Amet"
You will have to do some clever merging (depending on how far apart two TextRenderInfo objects are), and sorting (to get all the TextRenderInfo objects in the proper reading order.
Once that's done, it should be easy.
I was able to manipulate it with my previous version for Itext5. I don't know if you are looking for C# but that is what the below code is written in.
I also get down to each individual character because it works better for my process. You can manipulate the names, and of course the objects, but I created the textchunk to hold what I wanted, rather than have a bunch of renderInfo objects.
You can implement this by adding a few lines to grab the data from your pdf.
Once you are this far, you can pull the objectResult from the strat by making it public or creating a method within your class to grab the objectResult and do something with it.