How can I read PDF content with the itextsharp with the Pdfreader class. My PDF may include Plain text or Images of the text.
相关问题
- Sorting 3 numbers without branching [closed]
- Graphics.DrawImage() - Throws out of memory except
- Why am I getting UnauthorizedAccessException on th
- Correctly parse PDF paragraphs with Python
- 求获取指定qq 资料的方法
In my case I just wanted the text from a specific area of the PDF document so I used a rectangle around the area and extracted the text from it. In the sample below the coordinates are for the entire page. I don't have PDF authoring tools so when it came time to narrow down the rectangle to the specific location I took a few guesses at the coordinates until the area was found.
As noted by the above comments the resulting text doesn't maintain any of the formatting found in the PDF document, however I was happy that it did preserve the carriage returns. In my case there were enough constants in the text that I was able to extract the values that I required.
LGPL / FOSS iTextSharp 4.x
None of the other answers were useful to me, they all seem to target the AGPL v5 of iTextSharp. I could never find any reference to
SimpleTextExtractionStrategy
orLocationTextExtractionStrategy
in the FOSS version.Something else that might be very useful in conjunction with this:
This will extract the text only data from the PDF, if the text displayed is
Foo(bar)
it will be encoded in the PDF as(Foo\(bar\))Tj
, this method would returnFoo(bar)
as expected. This method will strip out lots of additional information such as location coordinates from the raw pdf content.Here is a VB.NET solution based on ShravankumarKumar's solution.
This will ONLY give you the text. The images are a different story.
You can't read and parse the contents of a PDF using iTextSharp like you'd like to.
From iTextSharp's SourceForge tutorial: