Because of some semantic analysis work, I need identify paragraphs from pdf files with iTextSharp. I know the coordinates of iTextSharp live in the left bottom corner of a page. I find three features to define the paragraph boundaries:
- if the horizontal axis of the first word in one line is less than that of the general lines;
- if the leading of two consecutive lines is larger than that of the general ones;
- if one line ends with "." and the horizontal axis of the ending word is less than that of the other lines
However, I am stuck on the second one. How can I know the general leading between two lines in a paragraph? I mean there are different gaps between two consecutive lines, because some letters like 'f','g' need more space than the others like 'a','n' and so on.
Thanks for your help!
I'm assuming that you are parsing your PDF files using the parser functionality available in iTextSharp. See for instance Extract font height and rotation from PDF files with iText/iTextSharp to see how others have done this before you. A more elaborate article can be found here: Using Open Source PDF Technology to Solve the Unstructured Data Problem in Healthcare
Your question is: how can I calculate the leading? That is: how do I know the distance between the base lines of two consecutive lines?
When you parse a PDF using iTextSharp, you see each line as a series of
TextRenderInfo
object. These objects allow you to get the base line of the text:This
Vector
consists of different elements: Getting Coordinates of string using ITextExtractionStrategy and LocationTextExtractionStrategy in ItextsharpYou need
startpoint[Vector.I2]
. See also: How to detect newline from PDF using iTextSharpThe difference between that value for two consecutive lines give you the value of the leading in its modern meaning. In the old times of printing, every character was a block of a fixed size. Printers (the people, not the machines) put a strip of lead between the rows of blocks to create some extra space between the lines. In modern computing, the word was preserved, but its meaning changed. There are no "blocks" anymore, but you could work with the font size. The font size is an average size of the glyphs in a font. Some glyphs will take more space in the height, some will take less, but taking both the leading (distance between baselines) and the font size (average height of each glyph) into account, you can get a fair idea of the "space between the lines".