I am using Pdfbox to search a word(or String) from a pdf file and I also want to know the coordinates of that word. For example :- in a pdf file there is a string like "${abc}". I want to know the coordinates of this string. I Tried some couple of examples but didn't get the result according to me. in result it is displaying the coordinates of character.
Here is the Code
@Override
protected void writeString(String string, List<TextPosition> textPositions) throws IOException {
for(TextPosition text : textPositions) {
System.out.println( "String[" + text.getXDirAdj() + "," +
text.getYDirAdj() + " fs=" + text.getFontSize() + " xscale=" +
text.getXScale() + " height=" + text.getHeightDir() + " space=" +
text.getWidthOfSpace() + " width=" +
text.getWidthDirAdj() + "]" + text.getUnicode());
}
}
I am using pdfbox 2.0
The last method in which PDFBox'
PDFTextStripper
class still has text with positions (before it is reduced to plain text) is the methodOne should intercept here because this method receives pre-processed, in particular sorted
TextPosition
objects (if one requested sorting to start with).(Actually I would have preferred to intercept in the calling method
writeLine
which according to the names of its parameters and local variables has all theTextPosition
instances of a line and callswriteString
once perword
; unfortunately, though, PDFBox developers have declared this method private... well, maybe this changes until the final 2.0.0 release... nudge, nudge. Update: Unfortunately it has not changed in the release... sigh)Furthermore it is helpful to use a helper class to wrap sequences of
TextPosition
instances in aString
-like class to make code clearer.With this in mind one can search for the variables like this
with this helper class
To merely output their positions, widths, final letters, and final letter positions, you can then use this
For tests I created a small test file using MS Word:
The output of this test
is
I was a bit surprised because
${var 2}
has been found if on a single line; after all, PDFBox code made me assume the methodwriteString
I overrode only retrieves words; it looks as if it retrieves longer parts of the line than mere words...If you need other data from the grouped
TextPosition
instances, simply enhanceTextPositionSequence
accordingly.I was looking for highlighting different words in a PDF file. For doing this, I need to know properly the word coordinates, so what I'm doing is getting the (x, y) coordinate from the top-left, from the first letter, and the (x, y) coordinate from the last letter from the top-right.
Later on, save the points in one array. Keep in mind that for getting properly the y coordinate you need the relative position in respect to the page size, because of the coordinate given. But the
getYDirAdj()
method is absolute and lots of time does not match with the one in the page.As mentioned, this is not an answer to your question but below is a skeleton example of how you would do this in
IText
. This is not saying the same is not possible in Pdfbox.Basically you make a
RenderListener
that accepts the "parse events" as they happen. You pass this listener toPdfReaderContentParser.processContent
. In the listener'srenderText
method you get all information you need to reconstruct the layout, including x/y coordinates and the text/image/... that make up the content.