Split and merge pdf files using PDFBOX produces la

2019-07-30 19:07发布

问题:

I have this large print file in pdf that's contains 5544 pages and is about 36mb in size. The file is created by MS Word 2010 and contains only text and a logo on each letter/document.

I split it into 5544 files and merge back into 2770 letters, based on keywords. Each letter is approx. 140-145kb.

When I merge all the letters into a new pdf print file, still containing 5544 pages, the size of the file is grown to 396mb.

All text extracting, splitting and merging is performed with calls to Apache PDFBox command-line tools from PHP, but result is the same when run from a console.

Any idea how to reduce the file size of the letters and the final print file? It seems like PDFBox has just appended each letters in the final print file, instead creating a new pdf-document.

It's only in the testing phase that all the documents are merged into the final print file, some of the documents will be send by email.

I have also tried SAMBox (a fork of PDFBox) but with nearly the same result:

pdfinfo Original.pdf Title: Printfile Author: Claus Hjort Bube Creator: Microsoft® Word 2010 Producer: Microsoft® Word 2010 CreationDate: Fri May 19 12:16:34 2017 CEST ModDate: Fri May 19 12:16:34 2017 CEST Tagged: yes UserProperties: no Suspects: no Form: none JavaScript: no Pages: 5544 Encrypted: no Page size: 595.32 x 841.92 pts (A4) Page rot: 0 File size: 36092281 bytes Optimized: no PDF version: 1.5

pdfinfo PDFBox.pdf Title: Printfile Author: Claus Hjort Bube Creator: Microsoft® Word 2010 Producer: Microsoft® Word 2010 CreationDate: Fri May 19 12:16:34 2017 CEST ModDate: Fri May 19 12:16:34 2017 CEST Tagged: no UserProperties: no Suspects: no Form: none JavaScript: no Pages: 5544 Encrypted: no Page size: 595.32 x 841.92 pts (A4) Page rot: 0 File size: 396622354 bytes Optimized: no PDF version: 1.4

pdfinfo SAMBox.pdf Creator: Sejda Console 3.2.17 Producer: SAMBox 1.1.8 (www.sejda.org) ModDate: Tue Jul 11 23:34:33 2017 CEST Tagged: no UserProperties: no Suspects: no Form: none JavaScript: no Pages: 5544 Encrypted: no Page size: 595.32 x 841.92 pts (A4) Page rot: 0 File size: 378779436 bytes Optimized: no PDF version: 1.7

回答1:

That may sound sad but it is correct. When splitting, each file gets the resources (e.g. fonts and company logo graphic) it needs. When merged back, PDFBox does not know that these may be the same over the whole document, so these are now duplicated a lot.

The only solution I see for you would be to use the PDFBox java API to create the mailing files and the final print file in one step, i.e. without creating single files that are merged back.



标签: pdf pdfbox