How to add/project iTextSharp vectors

2019-09-05 17:56发布

问题:

The Question

The PDF toolkit iText(Sharp) has its own vector type, implementing Substract, Dot product Cross product and Multiply, but I do not see an addition of vectors nor a projection of one vector onto antother.

Is there a simple way to do that?

The context

I am implementing an ITextExtractionStrategy that collects text into chunks (class MyTextChunk) if they

  • are on the same line renderInfo.GetBaseline()
  • follow closely to avoid concatenating texts from different columns in a table
  • and have the same height: renderInfo .GetAscentLine().GetStartPoint() .Subtract(renderInfo .GetDescentLine() .GetStartPoint()) .Length

If I encounter a single smaller superscript character (i.e. above the base line, but not above the ascent line), I suppose this is to be a referred to later in the text and store it as char referable in the chunk.

If the this referable is followed close enough by more text, this text must be included in the chunk. Therefore, I need to extend the baseline till after the referable character. So I thought writing something like

    public bool Append(TextRenderInfo renderInfo)
    {
        ...
        if (thisIsAReferable) 
        {
            this.referable = infoText.Trim()[0];
            Vector offsetVector = baseVector.Multiply(
                baseVector.Dot(renderInfo.GetBaseline().GetEndPoint()
                .Subtract(this.baseline.GetStartPoint()))
                / baseVector.LengthSquared);
            this.baseline = new LineSegment(this.baseline.GetStartPoint(), 
                this.baseline.GetStartPoint().Add(offsetVector));
            ...                
            return true;
        }
        ...
    }

Remark: The calculation of offsetVector is not yet verified.

回答1:

The Vector class is used to store the position of points on a PDF page when parsing a document. For instance: for each text snippet in the content stream of a PDF, we store several LineSegment objects: one that knows where the baseline starts and ends, one that knows where the ascent line starts and ends, one that knows where the descent line starts and ends. A LineSegment has two Vector elements, one for the start, one for the end.

The default coordinate system on a PDF page has an X axis that points to the right, and a Y axis that points upwards. The origin of the coordinate system depends on the value of the MediaBox (a mandatory property of each PDF page).

The default coordinate system can be changed using a transformation (*). The transformation is defined using a matrix that looks like this:

The operator to change the coordinate system (the cm operator) requires 6 operands: a, b, c, d, e and f. We don't need 9 operands in this matrix, because we are working in two dimensions.

If you want to define a translation, the a and d should be 1; b and c should be 0. You define the translation in the X direction by changing the value of e; in the Y direction by changing the value of f.

You can scale the coordinate system by defining b, c, e and f as 0 and change a to define the scaling factor in X direction, change d to define the scaling factor in Y direction. And so on. All of this is explained in great detail in ISO-32000-1.

I suggest that you perform translations (add) and other transformations using the cross() method and a matrix for which you define 6 elements.

We've never needed more methods for the Vector class because the parser always gives us the 6 operands of the cm operator, so we always have all the necessary elements to create a Matrix object.

(*) In PDF, we do not transform objects. We transform the coordinate system! I see that you live in Belgium. So do I, and I must admit that the concept of transforming the coordinate system seemed somewhat counter-intuitive because our schools and universities have taught us to transform objects, not coordinate systems.