The Question
The PDF toolkit iText(Sharp) has its own vector type, implementing Substract
, Dot
product Cross
product and Multiply
, but I do not see an addition of vectors nor a projection of one vector onto antother.
Is there a simple way to do that?
The context
I am implementing an ITextExtractionStrategy
that collects text into chunks (class MyTextChunk
) if they
- are on the same line
renderInfo.GetBaseline()
- follow closely to avoid concatenating texts from different columns in a table
- and have the same height:
renderInfo .GetAscentLine().GetStartPoint() .Subtract(renderInfo .GetDescentLine() .GetStartPoint()) .Length
If I encounter a single smaller superscript character (i.e. above the base line, but not above the ascent line), I suppose this is to be a referred to later in the text and store it as char referable
in the chunk.
If the this referable
is followed close enough by more text, this text must be included in the chunk. Therefore, I need to extend the baseline till after the referable character. So I thought writing something like
public bool Append(TextRenderInfo renderInfo)
{
...
if (thisIsAReferable)
{
this.referable = infoText.Trim()[0];
Vector offsetVector = baseVector.Multiply(
baseVector.Dot(renderInfo.GetBaseline().GetEndPoint()
.Subtract(this.baseline.GetStartPoint()))
/ baseVector.LengthSquared);
this.baseline = new LineSegment(this.baseline.GetStartPoint(),
this.baseline.GetStartPoint().Add(offsetVector));
...
return true;
}
...
}
Remark: The calculation of offsetVector is not yet verified.
The Vector
class is used to store the position of points on a PDF page when parsing a document. For instance: for each text snippet in the content stream of a PDF, we store several LineSegment
objects: one that knows where the baseline starts and ends, one that knows where the ascent line starts and ends, one that knows where the descent line starts and ends. A LineSegment
has two Vector
elements, one for the start, one for the end.
The default coordinate system on a PDF page has an X axis that points to the right, and a Y axis that points upwards. The origin of the coordinate system depends on the value of the MediaBox
(a mandatory property of each PDF page).
The default coordinate system can be changed using a transformation (*). The transformation is defined using a matrix that looks like this:
The operator to change the coordinate system (the cm
operator) requires 6 operands: a
, b
, c
, d
, e
and f
. We don't need 9 operands in this matrix, because we are working in two dimensions.
If you want to define a translation, the a
and d
should be 1
; b
and c
should be 0
. You define the translation in the X direction by changing the value of e
; in the Y direction by changing the value of f
.
You can scale the coordinate system by defining b
, c
, e
and f
as 0
and change a
to define the scaling factor in X direction, change d
to define the scaling factor in Y direction. And so on. All of this is explained in great detail in ISO-32000-1.
I suggest that you perform translations (add) and other transformations using the cross()
method and a matrix for which you define 6 elements.
We've never needed more methods for the Vector
class because the parser always gives us the 6 operands of the cm
operator, so we always have all the necessary elements to create a Matrix
object.
(*) In PDF, we do not transform objects. We transform the coordinate system! I see that you live in Belgium. So do I, and I must admit that the concept of transforming the coordinate system seemed somewhat counter-intuitive because our schools and universities have taught us to transform objects, not coordinate systems.