I am trying to split a document with a decent 300 pages using Apache PDFBOX API V2.0.2. While trying to split the pdf file to single pages using the following code:
PDDocument document = PDDocument.load(inputFile);
Splitter splitter = new Splitter();
List<PDDocument> splittedDocuments = splitter.split(document); //Exception happens here
I receive the following exception
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
Which indicates that the GC is taking much time to clear the heap that is not justified by the amount reclaimed.
There are numerous JVM tuning methods that can solve the situation, however, all of these are just treating the symptom and not the real issue.
One final note, I am using JDK6, hence using the new java 8 Consumer is not an option in my case.Thanks
Edit:
This is not a duplicate question of http://stackoverflow.com/questions/37771252/splitting-a-pdf-results-in-very-large-pdf-documents-with-pdfbox-2-0-2 as:
1. I do not have the size problem mentioned in the aforementioned topic. I am slicing a 270 pages 13.8MB PDF file and after slicing the size of each slice is an average of 80KB with total size of 30.7MB. 2. The Split throws the exception even before it returns the splitted parts.
I found that the split can pass as long as I am not passing the whole document, instead I pass it as "Batches" with 20-30 pages each, which does the job.
PDF Box stores the parts resulted from the split operation as objects of type PDDocument in the heap as objects, which results in heap getting filled fast, and even if you call the close() operation after every round in the loop, still the GC will not be able to reclaim the heap size in the same manner it gets filled.
An option is to split the document split operation to batches, in which each batch is a relatively manageable chunk (10 to 40 pages)