When I parse an existing PDF using iText(Sharp), I create an object which implements IRenderListener which I pass into PdfReaderContentParser.ProcessContent() and sure enough, my object's RenderText() gets called repeatedly with all the text in the PDF.
The problem is, the TextRenderInfo tells me about the base font (in my case, Helvetica) but I can't tell the height of the font nor its weight (regular vs. bold). Is this a known deficiency of iText(Sharp) or am I missing something?
the TextRenderInfo tells me about the base font (in my case, Helvetica) but I can't tell the height of the font nor its weight (regular vs. bold)
Height
Unfortunately iTextSharp does not provide a public font size method or member in the TextRenderInfo
. Some people worked around this by using the distance between its GetAscentLine()
and its GetDescentLine()
.
If you are ready to use Reflection
, though, you can do better by exposing and using the private TextRenderInfo
member GraphicsState gs
, e.g. like in this render listener:
public class LocationTextSizeExtractionStrategy : LocationTextExtractionStrategy
{
//Hold each coordinate
public List<SizeAndTextAndFont> myChunks = new List<SizeAndTextAndFont>();
//Automatically called for each chunk of text in the PDF
public override void RenderText(TextRenderInfo wholeRenderInfo)
{
base.RenderText(wholeRenderInfo);
GraphicsState gs = (GraphicsState) GsField.GetValue(wholeRenderInfo);
myChunks.Add(new SizeAndTextAndFont(gs.FontSize, wholeRenderInfo.GetText(), wholeRenderInfo.GetFont().PostscriptFontName));
}
FieldInfo GsField = typeof(TextRenderInfo).GetField("gs", System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance);
}
//Helper class that stores our rectangle, text, and font
public class SizeAndTextAndFont
{
public float Size;
public String Text;
public String Font;
public SizeAndTextAndFont(float size, String text, String font)
{
this.Size = size;
this.Text = text;
this.Font = font;
}
}
You can extract information with such a render listener like this:
using (var pdfReader = new PdfReader(testFile))
{
// Loop through each page of the document
for (var page = startPage; page < endPage; page++)
{
Console.WriteLine("\n Page {0}", page);
LocationTextSizeExtractionStrategy strategy = new LocationTextSizeExtractionStrategy();
PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
foreach (SizeAndTextAndFont p in strategy.myChunks)
{
Console.WriteLine(string.Format("<{0}> in {2} at {1}", p.Text, p.Size, p.Font));
}
}
}
This produces an output like this:
Page 1
< The Philippine Stock Exchange, Inc> in Helvetica-Bold at 8
< Daily Quotations Report> in Helvetica-Bold at 8
< March 23 , 2015> in Helvetica-Bold at 8
<Name> in Helvetica at 7
<Symbol> in Helvetica at 7
<Bid> in Helvetica at 7
[...]
Considering transformations
The numbers you see in the output as font sizes are the values of the font size property in the PDF graphics state at the time the respective text is drawn.
Due to the flexibility of PDF this may not be font size you eventually see in the output, though, a custom transformation may stretch the output considerably. Some PDF producers even always use a font size of 1 and transformations to stretch the output accordingly.
To get a good value for font sizes in such documents, you can improve the LocationTextSizeExtractionStrategy
method RenderText
like this:
public override void RenderText(TextRenderInfo wholeRenderInfo)
{
base.RenderText(wholeRenderInfo);
GraphicsState gs = (GraphicsState) GsField.GetValue(wholeRenderInfo);
Matrix textToUserSpaceTransformMatrix = (Matrix) TextToUserSpaceTransformMatrixField.GetValue(wholeRenderInfo);
float transformedFontSize = new Vector(0, gs.FontSize, 0).Cross(textToUserSpaceTransformMatrix).Length;
myChunks.Add(new SizeAndTextAndFont(transformedFontSize, wholeRenderInfo.GetText(), wholeRenderInfo.GetFont().PostscriptFontName));
}
with this additional reflection FieldInfo
member.
FieldInfo TextToUserSpaceTransformMatrixField = typeof(TextRenderInfo).GetField("textToUserSpaceTransformMatrix", System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance);
Weight
As you can see in the output above, the name of the font may contain more than the mere font family name but also a weight indicator
< March 23 , 2015> in Helvetica-Bold at 8
In your example, therefore,
the TextRenderInfo tells me about the base font (in my case, Helvetica)
the Helvetica without any decorations would imply a regular weight.
Helvetica is one of the standard 14 fonts which every PDF viewer must provide out-of-the-box: Times-Roman, Helvetica, Courier, Symbol, Times-Bold, Helvetica-Bold, Courier-Bold, ZapfDingbats, Times-Italic, Helvetica-Oblique, Courier-Oblique, Times-BoldItalic, Helvetica-BoldOblique, Courier-BoldOblique. Thus, these names are pretty dependable.
Unfortunately font names in general may be chosen arbitrarily; a bold font may have "Bold" or "Black" or other indicators of boldness in its name or none at all.
One might also try to use the font's FontDescriptor dictionary for which an entry FontWeight is specified. Unfortunately this entry is optional, you cannot count on it being there at all.
Furthermore, a font in a PDF can be artificially bold'ed, cf. this answer:
All these numbers are drawn using the same font, merely adding a rising outline line width.
Thus, I'm afraid there is no dependable way to find the exact font weight, merely a number of heuristics which may or may not return acceptable approximations.