Now I'm working on extracting line and rectangle from PDF by using iTextSharp. The method I used is as below:
PdfReader reader = new PdfReader(strPDFFileName);
var pageSize = reader.GetPageSize(1);
var cropBox = reader.GetCropBox(1);
byte[] pageBytes = reader.GetPageContent(1);
PRTokeniser tokeniser = new PRTokeniser(new(RandomAccessFileOrArray(pageBytes));
PRTokeniser.TokType tokenType;
string tokenValue;
CoordinateCollection cc = new CoordinateCollection();
while (tokeniser.NextToken())
{
tokenType = tokeniser.TokenType;
tokenValue = tokeniser.StringValue;
if (tokenType == PRTokeniser.TokType.OTHER)
{
if (tokenValue == "re")
{
if (buf.Count < 5)
{
continue;
}
float x = float.Parse(buf[buf.Count - 5]);
float y = float.Parse(buf[buf.Count - 4]);
float w = float.Parse(buf[buf.Count - 3]);
float h = float.Parse(buf[buf.Count - 2]);
Coordinate co = new Coordinate();
co.type = "re";
co.X1 = x;
co.Y1 = y;
co.W = w;
co.H = h;
cc.AddCoordinate(co);
}
}
}
The code works fine. But I encounter an issue about PDF measurement unit. The value get from reader.getPageSize is (619*792), it means the page size is 691*792, but when I get rectangle from tokeniser, the x and y are always over the page size, always the value of it is x=150,y=4200,w=1500,h=2000.
I believe the measurement unit of reader.getPageSize and tokeniser is different.
So could you please help to tell me How can I convert them?
As a starting remark: What you extract actually are the coordinate parameters of the re operation in the PDF content stream, their values are not iTextSharp specific.
The values you get
To understand why the coordinates of the rectangle seem so much off-page, you first have to realize that the coordinate system used in PDFs is mutable!
The user space coordinate system merely is initialized to a default state in which the CropBox entry in the page dictionary specifies the rectangle of user space corresponding to the visible area.
In the course of the page content operations, the coordinate system may be transformed, even multiple times, using the cm operation. Common transformations are rotations, translations, skews, and scalings.
In your case most likely at least a scaling is in place.
You might want to study details in section 8.3 "Coordinate Systems" of the PDF specification.
How to extract positions including the transformation
To retrieve coordinates including transformations, you have find cm operations in addition to the re operations. Furthermore, you have to find q and Q operations (save and restore graphics state, including the current transformation matrix).
Fortunately iTextSharp's parser namespace classes can do most of the heavy lifting for you, since version 5.5.6 they also support vector graphics. You merely have to implement IExtRenderListener
and parse content using an instance.
E.g. to output vector graphics information on the console, you can use an implementation like this:
class VectorGraphicsListener : IExtRenderListener
{
public void ModifyPath(PathConstructionRenderInfo renderInfo)
{
if (renderInfo.Operation == PathConstructionRenderInfo.RECT)
{
float x = renderInfo.SegmentData[0];
float y = renderInfo.SegmentData[1];
float w = renderInfo.SegmentData[2];
float h = renderInfo.SegmentData[3];
Vector a = new Vector(x, y, 1).Cross(renderInfo.Ctm);
Vector b = new Vector(x + w, y, 1).Cross(renderInfo.Ctm);
Vector c = new Vector(x + w, y + h, 1).Cross(renderInfo.Ctm);
Vector d = new Vector(x, y + h, 1).Cross(renderInfo.Ctm);
Console.Out.WriteLine("Rectangle at ({0}, {1}) with size ({2}, {3})", x, y, w, h);
Console.Out.WriteLine("--> at ({0}, {1}) ({2}, {3}) ({4}, {5}) ({6}, {7})", a[Vector.I1], a[Vector.I2], b[Vector.I1], b[Vector.I2], c[Vector.I1], c[Vector.I2], d[Vector.I1], d[Vector.I2]);
}
else
{
switch (renderInfo.Operation)
{
case PathConstructionRenderInfo.MOVETO:
Console.Out.Write("Move to");
break;
case PathConstructionRenderInfo.LINETO:
Console.Out.Write("Line to");
break;
case PathConstructionRenderInfo.CLOSE:
Console.Out.WriteLine("Close");
return;
default:
Console.Out.Write("Curve along");
break;
}
List<Vector> points = new List<Vector>();
for (int i = 0; i < renderInfo.SegmentData.Count - 1; i += 2)
{
float x = renderInfo.SegmentData[i];
float y = renderInfo.SegmentData[i + 1];
Console.Out.Write(" ({0}, {1})", x, y);
Vector a = new Vector(x, y, 1).Cross(renderInfo.Ctm);
points.Add(a);
}
Console.Out.WriteLine();
Console.Out.Write("--> at ");
foreach (Vector point in points)
{
Console.Out.Write(" ({0}, {1})", point[Vector.I1], point[Vector.I2]);
}
Console.Out.WriteLine();
}
}
public void ClipPath(int rule)
{
Console.Out.WriteLine("Clip");
}
public iTextSharp.text.pdf.parser.Path RenderPath(PathPaintingRenderInfo renderInfo)
{
switch (renderInfo.Operation)
{
case PathPaintingRenderInfo.FILL:
Console.Out.WriteLine("Fill");
break;
case PathPaintingRenderInfo.STROKE:
Console.Out.WriteLine("Stroke");
break;
case PathPaintingRenderInfo.STROKE + PathPaintingRenderInfo.FILL:
Console.Out.WriteLine("Stroke and fill");
break;
case PathPaintingRenderInfo.NO_OP:
Console.Out.WriteLine("Drop");
break;
}
return null;
}
public void BeginTextBlock() { }
public void EndTextBlock() { }
public void RenderImage(ImageRenderInfo renderInfo) { }
public void RenderText(TextRenderInfo renderInfo) { }
}
and apply it to a PDF like this:
using (var pdfReader = new PdfReader(....))
{
// Loop through each page of the document
for (var page = 1; page <= pdfReader.NumberOfPages; page++)
{
VectorGraphicsListener listener = new VectorGraphicsListener();
PdfReaderContentParser parser = new PdfReaderContentParser(pdfReader);
parser.ProcessContent(page, listener);
}
}
After Rectangle at, Move to, Line to, and Curve along you'll see the coordinate information without applying the transformation, i.e. retrieved like you did.
After --> you'll see the corresponding transformed coordinates.
PS This feature is still new. Probably it will shortly be supported by using an alternative, easier approach in which iTextSharp bundles path information for you instead of simply forwarding each path building operation one at a time.