Adding metadata to PDF

2019-06-16 17:56发布

问题:

I need to add metadata to a PDF which I am creating using prawn. That meta-data will be extracted later by, probably, pdf-reader. This metadata will contain internal document numbers and other information needed by downstream tools.

It would be convenient to associate meta-data with each page of the PDF. The PDF specification claims that I can store per-page private data in a "Page-Piece Dictionary". Section 14.5 states:

A page-piece dictionary (PDF 1.3) may be used to hold private conforming product data. The data may be associated with a page or form XObject by means of the optional PieceInfo entry in the page object (see Table 30) or form dictionary (see Table 95). Beginning with PDF 1.4, private data may also be associated with the PDF document by means of the PieceInfo entry in the document catalogue (see Table 28).

How can I set a "page-piece dictionary" with prawn? I'm using prawn 0.12.0.

If that's not possible, how else can I achieve my goal of storing metadata about each page, either at the page level, or at the document level?

回答1:

you can look at the source of prawn

https://github.com/prawnpdf/prawn/commit/131082af5abb71d83de0e2005ecceaa829224904

info = { :Title => "Sample METADATA",
             :Author => "Me",
             :Subject => "Not Working",
             :CreationDate => Time.now }

@pdf = Prawn::Document.new(:template => filename, :info => info) 


回答2:

One way is to do none of the above; that is, don't attach the metadata as a page-piece dictionary, and don't attach it with prawn. Instead, attach the metadata as a file attachment using the pdftk command-line tool.

To do it this way, create a file with the metadata. For example, the file metadata.yaml might contain:

---
- :document_id: '12345'
  :account_id: 10
  :page_numbers:
  - 1
  - 2
  - 3
- :document_id: '12346'
  :account_id: 24
  :page_numbers:
  - 4

After you are done creating the pdf file with prawn, then use pdftk to attach the metadata file to the pdf file:

$ pdftk foo.pdf attach_files metadata.yaml output foo-with-attachment.pdf

Since pdftk will not modify a file in place, the output file must be different than the input file.

You may be able to extract the metadata file using pdf-reader, but you can certainly do it with pdftk. This command unpacks metadata.yaml into the unpacked-attachments directory.

$ pdftk foo-with-attachment.pdf unpack_files output unpacked-attachments