I have written an extraction tool using iTextSharp that extracts annotation information from PDF documents. For the highlight annotation, I only get a rectangle for the area on the page which is highlighted.
I am aiming for extracting the text that has been highlighted. For that I use `PdfTextExtractor'.
Rectangle rect = new Rectangle(
pdfArray.GetAsNumber(0).FloatValue,
pdfArray.GetAsNumber(1).FloatValue,
pdfArray.GetAsNumber(2).FloatValue,
pdfArray.GetAsNumber(3).FloatValue);
RenderFilter[] filter = { new RegionTextRenderFilter(rect) };
ITextExtractionStrategy strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);
string textInsideRect = PdfTextExtractor.GetTextFromPage(pdfReader, pageNo, strategy);
return textInsideRect;
The result returned by PdfTextExtractor
is not entirely correct. For instance it returns "was going to eliminate the paper chase" even though only "eliminate" was highlighted.
Interesting enough the entire text for the TJ containing the highlighted "eliminate" is "was going to eliminate the paper chase" (TJ is the PDF instruction that writes text to the page).
I would love to hear any input regarding this issue - also solutions that doesn't involve iTextSharp.
Highlight annotations are represented a collection of quadrilaterals that represent the area(s) on the page surrounded by the annotation in the
/QuadPoints
entry in the dictionary.Why are they this way?
This is my fault, actually. In Acrobat 1.0, I worked on the "find text" code which initially only used a rectangle for the representation of a selected area on the page. While working on the code, I was very unhappy with the results, especially with maps where the text followed land details.
As a result, I made the find tool build up a set of quadrilaterals on the page and anneal them, when possible, to build words.
In Acrobat 2.0, the engineer responsible for full generalized text extraction built an algorithm called Wordy that was better than my first cut, but he kept the quadrilateral code since that was the most accurate representation of what was on the page.
Almost all text-related code was refactored to use this code.
Then we get highlight annotations. When markup annotations were added to Acrobat, they were used to decorate text that was already on the page. When a user clicks down on a page, Wordy extracts the text into appropriate data structures and then the text select tool maps mouse motion onto the quadrilateral sets. When a text highlight annotation is created, the subset of quadrilaterals from Wordy get placed into a new text highlight annotation.
How do you get the words on the page that are highlighted. Tricky. You have to extract the text on the page (you don't have Wordy, sorry) and then find all quads that are contained within the set from the annotation.
The cause
This actually is the reason for your issue. The iText parser classes forward the text to the render listeners in the pieces they find as continuous strings in the content stream. The filter mechanism you use filters these pieces. Thus, that whole sentence is accepted by the filter.
What you need, therefore, is some pre-processing step which splits these pieces into their individual characters and forwards these individually to your filtered render listener.
This actually is fairly easy to implement. The argument type in which the text pieces are forwarded,
TextRenderInfo,
offers a method to split itself up:Thus, all you have to do is create and use a
RenderListener
/IRenderListener
implementation which forwards all the calls it gets to another listener (your filtered listener in your case) with the twist thatrenderText
/RenderText
splits itsTextRenderInfo
argument and forwards the splinters one by one individually.A Java sample
As the OP asked for more details, here some more code. As I'm predominantly working with Java, though, I'm providing it in Java for iText. But it is easy to port to C# for iTextSharp.
As mentioned above a pre-processing step is needed which splits the text pieces into their individual characters and forwards them individually to your filtered render listener.
For this step you can use this class
TextRenderInfoSplitter
:If you have a
TextExtractionStrategy strategy
(like yournew FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter)
), you now can feed it with single-characterTextRenderInfo
instances like this:I tested it with the PDF created in this answer for the area
For reference I marked the area in the PDF:
Text extraction filtered by area without the
TextRenderInfoSplitter
results in:Text extraction filtered by area with the
TextRenderInfoSplitter
results in:BTW, you here see a disadvantage of splitting the text into individual characters early: The final text line is typeset using very large character spacing. If you keep the text segments from the PDF as they are, text extraction strategies still easily can see that the line consists of the two words using and PDFBox. As soon as you feed the text segments character by character into the text extraction strategies, they are likely to interpret such widely set words as many one-letter words.
An improvement
Something similar happens in my sample above, letters barely touching the area of interest make it into the result.
This is due to the
RegionTextRenderFilter
implementation ofallowText
allowing all text to continue whose baseline intersects the rectangle in question, even if the intersection consists of merely a single dot:Given that you first split the text into characters, you might want to check whether their respective base line is completely contained in the area in question, i.e. implement an own
RenderFilter
by copyingRegionTextRenderFilter
and then replacing the lineby
Depending on how exactly exactly text is highlighted in Adobe Acrobat Reader, though, you might want to change this in a completely custom way.