I am getting huge PDF files with lots of data. The current PDF is 350 MB and has about 40000 pages. It would of course have been nice to get smaller PDFs, but this is what I have to work with now :-(
I can open it in acrobat reader with some delay when loading but after that acrobat reader is quick.
Now I need to split the huge file into single pages, then try to read some recipient data from the pdf pages, and then send the one or two pages that each recipient should get to each particular recipient.
Here is my very small code so far using itextsharp:
var inFileName = @"huge350MB40000pages.pdf";
PdfReader reader = new PdfReader(inFileName);
var nbrPages = reader.NumberOfPages;
reader.Close();
What happens is it comes to the second line "new PdfReader" then stays there for perhaps 10 minutes, the process gets to about 1.7 GB in size, and then I get an OutOfMemoryException.
I think the "new PdfReader" attempts to read the entire PDF into memory.
Is there some other/better way to do this?
For example, can I somehow read only a part of a PDF file into memory instead of all of it at once?
Could it work better using some other library than itextsharp?
From what I have read, it looks like when instantiating the PdfReader that you should use the constructor that takes in a RandomAccessFileOrArray object. Disclaimer: I have not tried this out myself.
iTextSharp.text.pdf.PdfReader reader = new iTextSharp.text.pdf.PdfReader(new iTextSharp.text.pdf.RandomAccessFileOrArray(@"C:\PDFFile.pdf"), null);
This is a total shot in the dark, and I haven't tested this code - it's a code extract from the 'iText In Action' book that is given as an example of how to deal with large PDF files. The code is in Java but should be fairly easy to convert -
This is the method that loads everything into memory -
PdfReader reader;
long before;
before = getMemoryUse();
reader = new PdfReader(
"HelloWorldToRead.pdf", null);
System.out.println("Memory used by the full read: "
+ (getMemoryUse() - before));
This is the memory saving way, where the document should be loaded bit-by-bit as required -
before = getMemoryUse();
reader = new PdfReader(
new RandomAccessFileOrArray("HelloWorldToRead.pdf"), null);
System.out.println("Memory used by the partial read: "
+ (getMemoryUse() - before));
You might be able to use Ghostscript directly. http://svn.ghostscript.com/ghostscript/tags/ghostscript-9.02/doc/Use.htm#One_page_per_file
For reading the recipient data pdftextstream might be a good choice.
PDF Toolkit is quite useful for these types of tasks. Haven't tried it with such a huge file yet though.
Could it work better using some other library than itextsharp?
Please try Aspose.Pdf for .NET which allows you to split the PDF into single pages or you could split the PDF to different sets of pages in various ways, either using files or memory streams. API is very simple to learn and use. It works with large PDF files having large number of pages.
Disclosure: I work as developer evangelist at Aspose.