iTextSharp works well extracting plain text from PDF documents, but I'm having trouble with subscript/superscript text, common in technical documents.
TextChunk.SameLine()
requires two chunks to have identical vertical positioning to be "on" the same line, which isn't the case for superscript or subscript text. For example, on page 11 of this document, under "COMBUSTION EFFICIENCY":
http://www.mass.gov/courts/docs/lawlib/300-399cmr/310cmr7.pdf
Expected text:
monoxide (CO) in flue gas in accordance with the following formula: C.E. = [CO2 /(CO + CO2)]
Result text:
monoxide (CO) in flue gas in accordance with the following formula: C.E. = [CO /(CO + CO )]
2 2
I moved SameLine()
to LocationTextExtractionStrategy
and made public getters for the private TextChunk
properties it reads. This allowed me to adjust the tolerance on the fly in my own subclass, shown here:
public class SubSuperStrategy : LocationTextExtractionStrategy {
public int SameLineOrientationTolerance { get; set; }
public int SameLineDistanceTolerance { get; set; }
public override bool SameLine(TextChunk chunk1, TextChunk chunk2) {
var orientationDelta = Math.Abs(chunk1.OrientationMagnitude
- chunk2.OrientationMagnitude);
if(orientationDelta > SameLineOrientationTolerance) return false;
var distDelta = Math.Abs(chunk1.DistPerpendicular
- chunk2.DistPerpendicular);
return (distDelta <= SameLineDistanceTolerance);
}
}
Using a SameLineDistanceTolerance
of 3
, this corrects which line the sub/super chunks are assigned to, but the relative position of the text is way off:
monoxide (CO) in flue gas in accordance with the following formula: C.E. = [CO /(CO + CO )] 2 2
Sometimes the chunks get inserted somewhere in the middle of the text, and sometimes (as with this example) at the end. Either way, they don't end up in the right place. I suspect this might have something to do with font sizes, but I'm at my limits of understanding the bowels of this code.
Has anyone found another way to deal with this?
(I'm happy to submit a pull request with my changes if that helps.)
I just solved a similar problem, see my question. I detect subscripts as text that have a baseline between the Ascending and Descending lines of the preceding text. This snipped of code might be usefull:
More details after Chistmass.
To properly extract these subscripts and superscripts in line, one needs a different approach to check whether two text chunks are on the same line. The following classes represent one such approach.
I'm more at home in Java/iText; thus, I implemented this approach in Java first and only afterwards translated it to C#/iTextSharp.
An approach using Java & iText
I'm using the current development branch iText 5.5.8-SNAPSHOT.
A way to identify lines
Assuming text lines to be horizontal and the vertical extend of the bounding boxes of the glyphs on different lines to not overlap, one can try to identify lines using a
RenderListener
like this:(TextLineFinder.java)
This
RenderListener
tries to identify horizontal text lines by projecting the text bounding boxes onto the y axis. It assumes that these projections do not overlap for text from different lines, even in case of subscripts and superscripts.This class essentially is a reduced form of the
PageVerticalAnalyzer
used in this answer.Sorting text chunks by those lines
Having identified the lines like above, one can tweak iText's
LocationTextExtractionStrategy
to sort along those lines like this:(HorizontalTextExtractionStrategy.java)
This
TextExtractionStrategy
uses aTextLineFinder
to identify horizontal text lines and then uses these information to sort the text chunks.Beware, this code uses reflection to access private parent class members. This might not be allowed in all environments. In such a case, simply copy the
LocationTextExtractionStrategy
and directly insert the code.Extracting the text
Now one can use this text extraction strategy to extract the text with inline superscripts and subscripts like this:
(from ExtractSuperAndSubInLine.java)
The example text on page 11 of the OP's document, under "COMBUSTION EFFICIENCY", now is extracted like this:
The same approach using C# & iTextSharp
Explanations, warnings, and sample results from the Java-centric section still apply, here is the code:
I'm using iTextSharp 5.5.7.
A way to identify lines
Sorting text chunks by those lines
Extracting the text
UPDATE: Changes in
LocationTextExtractionStrategy
In iText 5.5.9-SNAPSHOT Commits 53526e4854fcb80c86cbc2e113f7a07401dc9a67 ("Refactor LocationTextExtractionStrategy...") through 1ab350beae148be2a4bef5e663b3d67a004ff9f8 ("Make TextChunkLocation a Comparable<> class...") the
LocationTextExtractionStrategy
architecture has been changed to allow for customizations like this without the need for reflection.Unfortunately this change breaks the HorizontalTextExtractionStrategy presented above. For iText versions after those commits one can use the following strategy:
(HorizontalTextExtractionStrategy2.java)