iTextSharp works well extracting plain text from PDF documents, but I'm having trouble with subscript/superscript text, common in technical documents.
TextChunk.SameLine()
requires two chunks to have identical vertical positioning to be "on" the same line, which isn't the case for superscript or subscript text. For example, on page 11 of this document, under "COMBUSTION EFFICIENCY":
http://www.mass.gov/courts/docs/lawlib/300-399cmr/310cmr7.pdf
Expected text:
monoxide (CO) in flue gas in accordance with the following formula: C.E. = [CO2 /(CO + CO2)]
Result text:
monoxide (CO) in flue gas in accordance with the following formula: C.E. = [CO /(CO + CO )]
2 2
I moved SameLine()
to LocationTextExtractionStrategy
and made public getters for the private TextChunk
properties it reads. This allowed me to adjust the tolerance on the fly in my own subclass, shown here:
public class SubSuperStrategy : LocationTextExtractionStrategy {
public int SameLineOrientationTolerance { get; set; }
public int SameLineDistanceTolerance { get; set; }
public override bool SameLine(TextChunk chunk1, TextChunk chunk2) {
var orientationDelta = Math.Abs(chunk1.OrientationMagnitude
- chunk2.OrientationMagnitude);
if(orientationDelta > SameLineOrientationTolerance) return false;
var distDelta = Math.Abs(chunk1.DistPerpendicular
- chunk2.DistPerpendicular);
return (distDelta <= SameLineDistanceTolerance);
}
}
Using a SameLineDistanceTolerance
of 3
, this corrects which line the sub/super chunks are assigned to, but the relative position of the text is way off:
monoxide (CO) in flue gas in accordance with the following formula: C.E. = [CO /(CO + CO )] 2 2
Sometimes the chunks get inserted somewhere in the middle of the text, and sometimes (as with this example) at the end. Either way, they don't end up in the right place. I suspect this might have something to do with font sizes, but I'm at my limits of understanding the bowels of this code.
Has anyone found another way to deal with this?
(I'm happy to submit a pull request with my changes if that helps.)
To properly extract these subscripts and superscripts in line, one needs a different approach to check whether two text chunks are on the same line. The following classes represent one such approach.
I'm more at home in Java/iText; thus, I implemented this approach in Java first and only afterwards translated it to C#/iTextSharp.
An approach using Java & iText
I'm using the current development branch iText 5.5.8-SNAPSHOT.
A way to identify lines
Assuming text lines to be horizontal and the vertical extend of the bounding boxes of the glyphs on different lines to not overlap, one can try to identify lines using a RenderListener
like this:
public class TextLineFinder implements RenderListener
{
@Override
public void beginTextBlock() { }
@Override
public void endTextBlock() { }
@Override
public void renderImage(ImageRenderInfo renderInfo) { }
/*
* @see RenderListener#renderText(TextRenderInfo)
*/
@Override
public void renderText(TextRenderInfo renderInfo)
{
LineSegment ascentLine = renderInfo.getAscentLine();
LineSegment descentLine = renderInfo.getDescentLine();
float[] yCoords = new float[]{
ascentLine.getStartPoint().get(Vector.I2),
ascentLine.getEndPoint().get(Vector.I2),
descentLine.getStartPoint().get(Vector.I2),
descentLine.getEndPoint().get(Vector.I2)
};
Arrays.sort(yCoords);
addVerticalUseSection(yCoords[0], yCoords[3]);
}
/**
* This method marks the given interval as used.
*/
void addVerticalUseSection(float from, float to)
{
if (to < from)
{
float temp = to;
to = from;
from = temp;
}
int i=0, j=0;
for (; i<verticalFlips.size(); i++)
{
float flip = verticalFlips.get(i);
if (flip < from)
continue;
for (j=i; j<verticalFlips.size(); j++)
{
flip = verticalFlips.get(j);
if (flip < to)
continue;
break;
}
break;
}
boolean fromOutsideInterval = i%2==0;
boolean toOutsideInterval = j%2==0;
while (j-- > i)
verticalFlips.remove(j);
if (toOutsideInterval)
verticalFlips.add(i, to);
if (fromOutsideInterval)
verticalFlips.add(i, from);
}
final List<Float> verticalFlips = new ArrayList<Float>();
}
(TextLineFinder.java)
This RenderListener
tries to identify horizontal text lines by projecting the text bounding boxes onto the y axis. It assumes that these projections do not overlap for text from different lines, even in case of subscripts and superscripts.
This class essentially is a reduced form of the PageVerticalAnalyzer
used in this answer.
Sorting text chunks by those lines
Having identified the lines like above, one can tweak iText's LocationTextExtractionStrategy
to sort along those lines like this:
public class HorizontalTextExtractionStrategy extends LocationTextExtractionStrategy
{
public class HorizontalTextChunk extends TextChunk
{
public HorizontalTextChunk(String string, Vector startLocation, Vector endLocation, float charSpaceWidth)
{
super(string, startLocation, endLocation, charSpaceWidth);
}
@Override
public int compareTo(TextChunk rhs)
{
if (rhs instanceof HorizontalTextChunk)
{
HorizontalTextChunk horRhs = (HorizontalTextChunk) rhs;
int rslt = Integer.compare(getLineNumber(), horRhs.getLineNumber());
if (rslt != 0) return rslt;
return Float.compare(getStartLocation().get(Vector.I1), rhs.getStartLocation().get(Vector.I1));
}
else
return super.compareTo(rhs);
}
@Override
public boolean sameLine(TextChunk as)
{
if (as instanceof HorizontalTextChunk)
{
HorizontalTextChunk horAs = (HorizontalTextChunk) as;
return getLineNumber() == horAs.getLineNumber();
}
else
return super.sameLine(as);
}
public int getLineNumber()
{
Vector startLocation = getStartLocation();
float y = startLocation.get(Vector.I2);
List<Float> flips = textLineFinder.verticalFlips;
if (flips == null || flips.isEmpty())
return 0;
if (y < flips.get(0))
return flips.size() / 2 + 1;
for (int i = 1; i < flips.size(); i+=2)
{
if (y < flips.get(i))
{
return (1 + flips.size() - i) / 2;
}
}
return 0;
}
}
@Override
public void renderText(TextRenderInfo renderInfo)
{
textLineFinder.renderText(renderInfo);
LineSegment segment = renderInfo.getBaseline();
if (renderInfo.getRise() != 0){ // remove the rise from the baseline - we do this because the text from a super/subscript render operations should probably be considered as part of the baseline of the text the super/sub is relative to
Matrix riseOffsetTransform = new Matrix(0, -renderInfo.getRise());
segment = segment.transformBy(riseOffsetTransform);
}
TextChunk location = new HorizontalTextChunk(renderInfo.getText(), segment.getStartPoint(), segment.getEndPoint(), renderInfo.getSingleSpaceWidth());
getLocationalResult().add(location);
}
public HorizontalTextExtractionStrategy() throws NoSuchFieldException, SecurityException
{
locationalResultField = LocationTextExtractionStrategy.class.getDeclaredField("locationalResult");
locationalResultField.setAccessible(true);
textLineFinder = new TextLineFinder();
}
@SuppressWarnings("unchecked")
List<TextChunk> getLocationalResult()
{
try
{
return (List<TextChunk>) locationalResultField.get(this);
}
catch (IllegalArgumentException | IllegalAccessException e)
{
e.printStackTrace();
throw new RuntimeException(e);
}
}
final Field locationalResultField;
final TextLineFinder textLineFinder;
}
(HorizontalTextExtractionStrategy.java)
This TextExtractionStrategy
uses a TextLineFinder
to identify horizontal text lines and then uses these information to sort the text chunks.
Beware, this code uses reflection to access private parent class members. This might not be allowed in all environments. In such a case, simply copy the LocationTextExtractionStrategy
and directly insert the code.
Extracting the text
Now one can use this text extraction strategy to extract the text with inline superscripts and subscripts like this:
String extract(PdfReader reader, int pageNo) throws IOException, NoSuchFieldException, SecurityException
{
return PdfTextExtractor.getTextFromPage(reader, pageNo, new HorizontalTextExtractionStrategy());
}
(from ExtractSuperAndSubInLine.java)
The example text on page 11 of the OP's document, under "COMBUSTION EFFICIENCY", now is extracted like this:
monoxide (CO) in flue gas in accordance with the following formula: C.E. = [CO 2/(CO + CO 2 )]
The same approach using C# & iTextSharp
Explanations, warnings, and sample results from the Java-centric section still apply, here is the code:
I'm using iTextSharp 5.5.7.
A way to identify lines
public class TextLineFinder : IRenderListener
{
public void BeginTextBlock() { }
public void EndTextBlock() { }
public void RenderImage(ImageRenderInfo renderInfo) { }
public void RenderText(TextRenderInfo renderInfo)
{
LineSegment ascentLine = renderInfo.GetAscentLine();
LineSegment descentLine = renderInfo.GetDescentLine();
float[] yCoords = new float[]{
ascentLine.GetStartPoint()[Vector.I2],
ascentLine.GetEndPoint()[Vector.I2],
descentLine.GetStartPoint()[Vector.I2],
descentLine.GetEndPoint()[Vector.I2]
};
Array.Sort(yCoords);
addVerticalUseSection(yCoords[0], yCoords[3]);
}
void addVerticalUseSection(float from, float to)
{
if (to < from)
{
float temp = to;
to = from;
from = temp;
}
int i=0, j=0;
for (; i<verticalFlips.Count; i++)
{
float flip = verticalFlips[i];
if (flip < from)
continue;
for (j=i; j<verticalFlips.Count; j++)
{
flip = verticalFlips[j];
if (flip < to)
continue;
break;
}
break;
}
bool fromOutsideInterval = i%2==0;
bool toOutsideInterval = j%2==0;
while (j-- > i)
verticalFlips.RemoveAt(j);
if (toOutsideInterval)
verticalFlips.Insert(i, to);
if (fromOutsideInterval)
verticalFlips.Insert(i, from);
}
public List<float> verticalFlips = new List<float>();
}
Sorting text chunks by those lines
public class HorizontalTextExtractionStrategy : LocationTextExtractionStrategy
{
public class HorizontalTextChunk : TextChunk
{
public HorizontalTextChunk(String stringValue, Vector startLocation, Vector endLocation, float charSpaceWidth, TextLineFinder textLineFinder)
: base(stringValue, startLocation, endLocation, charSpaceWidth)
{
this.textLineFinder = textLineFinder;
}
override public int CompareTo(TextChunk rhs)
{
if (rhs is HorizontalTextChunk)
{
HorizontalTextChunk horRhs = (HorizontalTextChunk) rhs;
int rslt = CompareInts(getLineNumber(), horRhs.getLineNumber());
if (rslt != 0) return rslt;
return CompareFloats(StartLocation[Vector.I1], rhs.StartLocation[Vector.I1]);
}
else
return base.CompareTo(rhs);
}
public override bool SameLine(TextChunk a)
{
if (a is HorizontalTextChunk)
{
HorizontalTextChunk horAs = (HorizontalTextChunk) a;
return getLineNumber() == horAs.getLineNumber();
}
else
return base.SameLine(a);
}
public int getLineNumber()
{
Vector startLocation = StartLocation;
float y = startLocation[Vector.I2];
List<float> flips = textLineFinder.verticalFlips;
if (flips == null || flips.Count == 0)
return 0;
if (y < flips[0])
return flips.Count / 2 + 1;
for (int i = 1; i < flips.Count; i+=2)
{
if (y < flips[i])
{
return (1 + flips.Count - i) / 2;
}
}
return 0;
}
private static int CompareInts(int int1, int int2){
return int1 == int2 ? 0 : int1 < int2 ? -1 : 1;
}
private static int CompareFloats(float float1, float float2)
{
return float1 == float2 ? 0 : float1 < float2 ? -1 : 1;
}
TextLineFinder textLineFinder;
}
public override void RenderText(TextRenderInfo renderInfo)
{
textLineFinder.RenderText(renderInfo);
LineSegment segment = renderInfo.GetBaseline();
if (renderInfo.GetRise() != 0){ // remove the rise from the baseline - we do this because the text from a super/subscript render operations should probably be considered as part of the baseline of the text the super/sub is relative to
Matrix riseOffsetTransform = new Matrix(0, -renderInfo.GetRise());
segment = segment.TransformBy(riseOffsetTransform);
}
TextChunk location = new HorizontalTextChunk(renderInfo.GetText(), segment.GetStartPoint(), segment.GetEndPoint(), renderInfo.GetSingleSpaceWidth(), textLineFinder);
getLocationalResult().Add(location);
}
public HorizontalTextExtractionStrategy()
{
locationalResultField = typeof(LocationTextExtractionStrategy).GetField("locationalResult", System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance);
textLineFinder = new TextLineFinder();
}
List<TextChunk> getLocationalResult()
{
return (List<TextChunk>) locationalResultField.GetValue(this);
}
System.Reflection.FieldInfo locationalResultField;
TextLineFinder textLineFinder;
}
Extracting the text
string extract(PdfReader reader, int pageNo)
{
return PdfTextExtractor.GetTextFromPage(reader, pageNo, new HorizontalTextExtractionStrategy());
}
UPDATE: Changes in LocationTextExtractionStrategy
In iText 5.5.9-SNAPSHOT Commits 53526e4854fcb80c86cbc2e113f7a07401dc9a67 ("Refactor LocationTextExtractionStrategy...") through 1ab350beae148be2a4bef5e663b3d67a004ff9f8 ("Make TextChunkLocation a Comparable<> class...") the LocationTextExtractionStrategy
architecture has been changed to allow for customizations like this without the need for reflection.
Unfortunately this change breaks the HorizontalTextExtractionStrategy presented above. For iText versions after those commits one can use the following strategy:
public class HorizontalTextExtractionStrategy2 extends LocationTextExtractionStrategy
{
public static class HorizontalTextChunkLocationStrategy implements TextChunkLocationStrategy
{
public HorizontalTextChunkLocationStrategy(TextLineFinder textLineFinder)
{
this.textLineFinder = textLineFinder;
}
@Override
public TextChunkLocation createLocation(TextRenderInfo renderInfo, LineSegment baseline)
{
return new HorizontalTextChunkLocation(baseline.getStartPoint(), baseline.getEndPoint(), renderInfo.getSingleSpaceWidth());
}
final TextLineFinder textLineFinder;
public class HorizontalTextChunkLocation implements TextChunkLocation
{
/** the starting location of the chunk */
private final Vector startLocation;
/** the ending location of the chunk */
private final Vector endLocation;
/** unit vector in the orientation of the chunk */
private final Vector orientationVector;
/** the orientation as a scalar for quick sorting */
private final int orientationMagnitude;
/** perpendicular distance to the orientation unit vector (i.e. the Y position in an unrotated coordinate system)
* we round to the nearest integer to handle the fuzziness of comparing floats */
private final int distPerpendicular;
/** distance of the start of the chunk parallel to the orientation unit vector (i.e. the X position in an unrotated coordinate system) */
private final float distParallelStart;
/** distance of the end of the chunk parallel to the orientation unit vector (i.e. the X position in an unrotated coordinate system) */
private final float distParallelEnd;
/** the width of a single space character in the font of the chunk */
private final float charSpaceWidth;
public HorizontalTextChunkLocation(Vector startLocation, Vector endLocation, float charSpaceWidth)
{
this.startLocation = startLocation;
this.endLocation = endLocation;
this.charSpaceWidth = charSpaceWidth;
Vector oVector = endLocation.subtract(startLocation);
if (oVector.length() == 0)
{
oVector = new Vector(1, 0, 0);
}
orientationVector = oVector.normalize();
orientationMagnitude = (int)(Math.atan2(orientationVector.get(Vector.I2), orientationVector.get(Vector.I1))*1000);
// see http://mathworld.wolfram.com/Point-LineDistance2-Dimensional.html
// the two vectors we are crossing are in the same plane, so the result will be purely
// in the z-axis (out of plane) direction, so we just take the I3 component of the result
Vector origin = new Vector(0,0,1);
distPerpendicular = (int)(startLocation.subtract(origin)).cross(orientationVector).get(Vector.I3);
distParallelStart = orientationVector.dot(startLocation);
distParallelEnd = orientationVector.dot(endLocation);
}
public int orientationMagnitude() { return orientationMagnitude; }
public int distPerpendicular() { return distPerpendicular; }
public float distParallelStart() { return distParallelStart; }
public float distParallelEnd() { return distParallelEnd; }
public Vector getStartLocation() { return startLocation; }
public Vector getEndLocation() { return endLocation; }
public float getCharSpaceWidth() { return charSpaceWidth; }
/**
* @param as the location to compare to
* @return true is this location is on the the same line as the other
*/
public boolean sameLine(TextChunkLocation as)
{
if (as instanceof HorizontalTextChunkLocation)
{
HorizontalTextChunkLocation horAs = (HorizontalTextChunkLocation) as;
return getLineNumber() == horAs.getLineNumber();
}
else
return orientationMagnitude() == as.orientationMagnitude() && distPerpendicular() == as.distPerpendicular();
}
/**
* Computes the distance between the end of 'other' and the beginning of this chunk
* in the direction of this chunk's orientation vector. Note that it's a bad idea
* to call this for chunks that aren't on the same line and orientation, but we don't
* explicitly check for that condition for performance reasons.
* @param other
* @return the number of spaces between the end of 'other' and the beginning of this chunk
*/
public float distanceFromEndOf(TextChunkLocation other)
{
float distance = distParallelStart() - other.distParallelEnd();
return distance;
}
public boolean isAtWordBoundary(TextChunkLocation previous)
{
/**
* Here we handle a very specific case which in PDF may look like:
* -.232 Tc [( P)-226.2(r)-231.8(e)-230.8(f)-238(a)-238.9(c)-228.9(e)]TJ
* The font's charSpace width is 0.232 and it's compensated with charSpacing of 0.232.
* And a resultant TextChunk.charSpaceWidth comes to TextChunk constructor as 0.
* In this case every chunk is considered as a word boundary and space is added.
* We should consider charSpaceWidth equal (or close) to zero as a no-space.
*/
if (getCharSpaceWidth() < 0.1f)
return false;
float dist = distanceFromEndOf(previous);
return dist < -getCharSpaceWidth() || dist > getCharSpaceWidth()/2.0f;
}
public int getLineNumber()
{
Vector startLocation = getStartLocation();
float y = startLocation.get(Vector.I2);
List<Float> flips = textLineFinder.verticalFlips;
if (flips == null || flips.isEmpty())
return 0;
if (y < flips.get(0))
return flips.size() / 2 + 1;
for (int i = 1; i < flips.size(); i+=2)
{
if (y < flips.get(i))
{
return (1 + flips.size() - i) / 2;
}
}
return 0;
}
@Override
public int compareTo(TextChunkLocation rhs)
{
if (rhs instanceof HorizontalTextChunkLocation)
{
HorizontalTextChunkLocation horRhs = (HorizontalTextChunkLocation) rhs;
int rslt = Integer.compare(getLineNumber(), horRhs.getLineNumber());
if (rslt != 0) return rslt;
return Float.compare(getStartLocation().get(Vector.I1), rhs.getStartLocation().get(Vector.I1));
}
else
{
int rslt;
rslt = Integer.compare(orientationMagnitude(), rhs.orientationMagnitude());
if (rslt != 0) return rslt;
rslt = Integer.compare(distPerpendicular(), rhs.distPerpendicular());
if (rslt != 0) return rslt;
return Float.compare(distParallelStart(), rhs.distParallelStart());
}
}
}
}
@Override
public void renderText(TextRenderInfo renderInfo)
{
textLineFinder.renderText(renderInfo);
super.renderText(renderInfo);
}
public HorizontalTextExtractionStrategy2() throws NoSuchFieldException, SecurityException
{
this(new TextLineFinder());
}
public HorizontalTextExtractionStrategy2(TextLineFinder textLineFinder) throws NoSuchFieldException, SecurityException
{
super(new HorizontalTextChunkLocationStrategy(textLineFinder));
this.textLineFinder = textLineFinder;
}
final TextLineFinder textLineFinder;
}
(HorizontalTextExtractionStrategy2.java)
I just solved a similar problem, see my question. I detect subscripts as text that have a baseline between the Ascending and Descending lines of the preceding text. This snipped of code might be usefull:
Vector thisFacade = this.ascentLine.GetStartPoint().Subtract(this.descentLine.GetStartPoint());
Vector infoFacade = renderInfo.GetAscentLine().GetStartPoint().Subtract(renderInfo.GetDescentLine().GetStartPoint());
if (baseVector.Cross(ascent2base).Dot(baseVector.Cross(descent2base)) < 0
&& infoFacade.LengthSquared < thisFacade.LengthSquared - sameHeightThreshols)
More details after Chistmass.