Why does combining PDFs make filesize balloon?

2019-03-06 18:40发布

问题:

I'm attempting to strip together various PDFs. They're not that text heavy, with the occasional image. Say for example I have two PDFs, 1.4Mb and 740kb - when I combine them they balloon to 6Mb!

I've tried scripted combination, and hand appending, with the same result, so I'm guessing it's an underlying issue. Some explanation of why it happens would be useful, so I can look at ways of avoiding it. Is it a mismatch in colour models? They fonts are minimal.

回答1:

You aren't telling us how you're combining the PDFs which makes your question rather theoretical, so I am going to give you a theoretical answer:

Part 1

  • Suppose you have a PDF file with 10 pages and a total size of 1200 KByte.
  • Suppose that the content stream of each page roughly consists of 100 KByte. From this content stream, there are references to shared resources.
  • Suppose that these 10 pages share 200 KByte in resources: they share the same fonts, the same images, and so on.

If you "burst" this PDF into 10 separate single-page PDFs, each PDF will consist of about 300 KByte: 100 KByte in content stream + 200 KByte in resources (I'm ignoring the overhead of having 10 separate xref tables and file trailers).

  • If you combine these 10 separate single-page PDFs as if these 10 PDFs have nothing in common, the total file size will be 10 x 300 KByte. That's 3000 KByte, which is more than double of the original 1200 KByte.
  • If you combine these 10 separate single-page PDFs taking into account that they have resources in common (fonts, resources,...), the total size will be (10 x 100 KByte) + 200 KByte.

If you're using iText to combine the PDFs, then using PdfCopy will result in the 3000 KByte PDF, because PdfCopy just copies documents as fast as possible without looking at the content of the document. If you want the 1200 KByte PDF, then you need to use PdfSmartCopy in which case you'll need more memory and CPU because iText will examine each PDF and reuse objects that would otherwise be redundant.

Part 2

In your question, you mention that you have a 1.4Mb and a 740kb PDF, and that 1.4Mb + 740kb results in a PDF of 6Mb. The first part of my theoretical example doesn't explain the extreme growth in size, so here's a second part.

  • In PDF 1.0, PDF syntax wasn't compressed.
  • Starting with PDF 1.2, streams were compressed, but indirect objects and the cross-reference stream were stored in ASCII.
  • Starting with PDF 1.5, a series of objects could be compressed in an object stream and the cross-reference table could be compressed too.

Suppose that your original PDFs have compressed object streams and a compressed cross-reference table. Suppose that you combine these PDFs into a PDF that is more like a PDF 1.4 document. In that case, the compressed objects and the compressed cross-reference stream will no longer be compressed, resulting in a much bigger file size.

Part 3?

There might be other reasons, depending on the nature of the original PDFs and on the tool that you're using to combine the PDFs. You should clarify if none of the above applies.