How to detect newline from PDF using iTextSharp [c

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.

Closed 7 years ago.

I have used getbaseline[vector.I2] for calculating subscript and superscript. By doing this I'm not able to extract newline from PDF. Can you please suggest to me how to get newline from PDF using iTextSharp?

The code you supplied isn't completely self-explanatory. Thus I make some assumptions, foremost that your code is some excerpt of the RenderText(TextRenderInfo) method of a RenderListener implementation, probably some extension of the SimpleTextExtractionStrategy with added member variables lastBaseLine, firstcharacter_baseline, lastFontSize, and lastFont.

This implies that you only seem to be interested in documents in which text occurs in the content stream in reading order; otherwise you would have based your code on the LocationTextExtractionStrategy or a similar base algorithm.

Furthermore I don't understand some of your if statements which are either always false or always true, or the code body for which is empty. Nor is clear what text_second is good for, or why you calculate difference = curBaseline[Vector.I2] - curBaseline[Vector.I2] in one place.

All this being said, your initial if statement seems to test whether or not the vertical base line position of the new text is different from that of the text before. Thus, this is where you could also spot the start of a new line.

I would propose that you start not only storing the last base line but also the last descent line, which according to the docs is the line that represents the bottom most extent that a string of the current font could have, and compare it with the current ascent line (by the docs the line that represents the topmost extent that a string of the current font could have).

If the ascent line of the current text is below the descent line of last text, that should mean that we have a new line, it's too far down for a subscript. In code, therefore:

[...]
else if (curBaseline[Vector.I2] < lastBaseLine[Vector.I2])
{
    if (curAscentLine[Vector.I2] < lastDescentLine[Vector.I2])
    {
        firstcharacter_baseline = character_baseline;
        this.result.Append("<br/>");
    }
    else
    {
        difference = firstcharacter_baseline - curBaseline[Vector.I2];
        text_second.SetTextRise(difference);

        if (difference == 0)
        {
        }
        else
        {
            SupSubFlag = 2;
        }
    }
}
[...]

As you expect the text in the content stream to occur in reading order, you can also try to recognize a new line by comparing the Vector.I1 coordinates of the end of the base line of the last text and the start of the base line of the new text. If the new one is a relevant amount less than the old one, this looks like a carriage return hinting at a new line.

The code, of course, will run into trouble in a number of situations:

Whenever your expectation that the text in the content stream occurs in reading order, is not fulfilled, you'll get garbage all over.
When you have multicolumnar text, the test above won't catch the line break between the bottom of one column and the top of the next. To also catch this, you might want to check (analogouly to the proposed check for a jump a line down) whether the new text is way above the last text, comparing the last ascent line with the new descent line.
If you get PDFs with very densely packed text, lines might overlap with superscript and subscript of surrounding lines. In this case you will have to fine tune the comparisons. But here you will definitively run into falsely detected breaks sometimes.
If you get PDFs with rotated text, you'll get garbabr all over.