identify paragraphs of pdf fiiles using itextsharp

2019-08-14 18:52发布

Because of some semantic analysis work, I need identify paragraphs from pdf files with iTextSharp. I know the coordinates of iTextSharp live in the left bottom corner of a page. I find three features to define the paragraph boundaries:

  1. if the horizontal axis of the first word in one line is less than that of the general lines;
  2. if the leading of two consecutive lines is larger than that of the general ones;
  3. if one line ends with "." and the horizontal axis of the ending word is less than that of the other lines

However, I am stuck on the second one. How can I know the general leading between two lines in a paragraph? I mean there are different gaps between two consecutive lines, because some letters like 'f','g' need more space than the others like 'a','n' and so on.

Thanks for your help!

1条回答
Explosion°爆炸
2楼-- · 2019-08-14 19:42

I'm assuming that you are parsing your PDF files using the parser functionality available in iTextSharp. See for instance Extract font height and rotation from PDF files with iText/iTextSharp to see how others have done this before you. A more elaborate article can be found here: Using Open Source PDF Technology to Solve the Unstructured Data Problem in Healthcare

Your question is: how can I calculate the leading? That is: how do I know the distance between the base lines of two consecutive lines?

When you parse a PDF using iTextSharp, you see each line as a series of TextRenderInfo object. These objects allow you to get the base line of the text:

LineSegment baseline = renderInfo.GetBaseline();
Vector startpoint = baseline.GetStartPoint();

This Vector consists of different elements: Getting Coordinates of string using ITextExtractionStrategy and LocationTextExtractionStrategy in Itextsharp

You need startpoint[Vector.I2]. See also: How to detect newline from PDF using iTextSharp

The difference between that value for two consecutive lines give you the value of the leading in its modern meaning. In the old times of printing, every character was a block of a fixed size. Printers (the people, not the machines) put a strip of lead between the rows of blocks to create some extra space between the lines. In modern computing, the word was preserved, but its meaning changed. There are no "blocks" anymore, but you could work with the font size. The font size is an average size of the glyphs in a font. Some glyphs will take more space in the height, some will take less, but taking both the leading (distance between baselines) and the font size (average height of each glyph) into account, you can get a fair idea of the "space between the lines".

查看更多
登录 后发表回答