How to find the trailer dictionary?

2020-07-16 02:59发布

问题:

Going through the PDF spec, it says that the trailer precedes the startxref. Which to me, says that the xref can appear anywhere in the document, but the trailer still appears before the startxref. This makes sense until you have to parse it, because you have to parse in reverse you can't take into account comments or strings. Lets get a little more wacky then.

trailer<< %\
  /Size 4 %\
  /Root 1 0 R %\
  /Info 4 0 R %\
  /Key (\
trailer<< %\
  /Size 4 %\
  /Root 2 0 R %\
  /Info 3 0 R %\
>>%)
>>&)
% test test )
startxref
 15
%%EOF

Which is a perfectly valid trailer. The first one is the real trailer, but the second one is in a "string". In this case, reverse parsing is going to fail to catch the comments. Looking for the string trailer is going to fail if its apart of a comment or string. I was wondering what the best way of finding out where the trailer starts is?

Update - This trailer seems to open in Acrobat Reader

%PDF-1.3
%âãÏÓ
xref
0 4
00000000 65535 f
00000110 00000 n
00000250 00000 n
00000315 00000 n
00000576 00000 n

1 0 obj <<
  /Type /Catalog
  /Pages 2 0 R
  /OpenAction [ 3 0 R /XYZ null null null ]
  /PageLabels << /Nums [0 << /S /D >> ] >>
>>
endobj
2 0 obj <<
  /Type /Pages
  /Kids [ 3 0 R ]
  /Count 1
>>
endobj
3 0 obj <<
  /Type /Page
  /Parent 2 0 R
  /Resources << >>
  /MediaBox [ 0 0 612 792 ]
>>
endobj
4 0 obj <<
  /Producer (Me)
  /CreationDate (D:20110626000000Z)
>>
endobj

trailer<< %\
  /Size 4 %\
  /Root 1 0 R %\
  /Info 4 0 R %\
  /Key (\
trailer<< %\
  /Size 4 %\
  /Root 2 0 R %\
  /Info 3 0 R %\
>>%)
>>%)
% test test )
startxref
 15
%%EOF

As far as syntax goes, this conforms to spec. Somehow they seem to be able to know if they are in a comment, or a string. Parsing L-R, the second trailer is in a string with a % tailed on, with a comment after the trailer. But R-L parsing, you have no idea if the first ) is part of a comment, or the end of a string definition.

Another Example:

%PDF-1.3
%âãÏÓ
xref
0 8
0000000000 65535 f
0000000210 00000 n
0000000357 00000 n
0000000428 00000 n
0000000533 00000 n
0000000612 00000 n
0000000759 00000 n
0000000830 00000 n
0000000935 00000 n

1 0 obj <<
  /Type /Catalog
  /Pages 2 0 R
  /OpenAction [ 3 0 R /XYZ null null null ]
  /PageLabels << /Nums [0 << /S /D >> ] >>
>>
endobj
2 0 obj <<
  /Type /Pages
  /Kids [ 3 0 R ]
  /Count 1
>>
endobj
3 0 obj <<
  /Type /Page
  /Parent 2 0 R
  /Resources << >>
  /MediaBox [ 0 0 612 792 ]
>>
endobj
4 0 obj <<
  /Producer (Me)
  /CreationDate (D:20110626000000Z)
>>
endobj
5 0 obj <<
  /Type /Catalog
  /Pages 6 0 R
  /OpenAction [ 7 0 R /XYZ null null null ]
  /PageLabels << /Nums [0 << /S /D >> ] >>
>>
endobj
6 0 obj <<
  /Type /Pages
  /Kids [ 7 0 R ]
  /Count 1
>>
endobj
7 0 obj <<
  /Type /Page
  /Parent 6 0 R
  /Resources << >>
  /MediaBox [ 0 0 100 100 ]
>>
endobj
8 0 obj <<
  /Producer (Me)
  /CreationDate (D:20110626000000Z)
>>
endobj

trailer<< %\
  /Size 8 %\
  /Root 1 0 R %\
  /Info 4 0 R %\
  /Key (\
trailer<< %\
  /Size 8 %\
  /Root 5 0 R %\
  /Info 8 0 R %\
>>%)
>>%)
% test test )
startxref
 17
%%EOF

This example, is displayed correctly in Adobe. In my last case, you claimed it would fail because the "root" node is invalid, but this new sample, the root is valid, but its never actually used. So shouldn't it display a 100x100 window, instead of the 8.5"x11"?

In regard to the Resources

  (Required; inheritable) A dictionary containing any resources required by the page 
(see Section 3.7.2, “Resource Dictionaries”). If the page requires no resources, the 
value of this entry should be an empty dictionary. Omitting the entry entirely
indicates that the resources are to be inherited from an ancestor node in the page 
tree.

回答1:

The startxref statement usually is at the end of the file, with the trailer preceeding it.

Update: Above introductionary sentence was not clearly enough formulated, as Jeremy Walton correctly observed (though later comments in my answer hinted at the exceptions). It should have read: "The startref statement appears usually at the end of the file as a single instance, with the trailer preceeding it (unless your file has undergone incremental updates, in which case you may have different instances of cross-references with assorted trailers."

If there are comments sprinkled into the PDF, they count the same as "real" PDF page description code when it comes to byte counting for the xref table byte-offset calculations. Therefor, it is not a problem to parse it correctly.

To quote straight "from the horse's mouth" (PDF specification ISO 32000-1, Section 7.5.5):

"The trailer of a PDF file enables a conforming reader to quickly find the cross-reference table and certain special objects. Conforming readers should read a PDF file from its end. The last line of the file shall contain only the end-of-file marker, %%EOF. The two preceding lines shall contain, one per line and in order, the keyword startxref and the byte offset in the decoded stream from the beginning of the file to the beginning of the xref keyword in the last cross-reference section. The startxref line shall be preceded by the trailer dictionary, consisting of the keyword trailer followed by a series of key-value pairs enclosed in double angle brackets [...]"

The key expression to take into account here is "LAST cross-reference section".

If you are having in mind updated trailers, then have a look at Section 7.5.6.

Yes, you have to parse in reverse. The first cross-reference section to read is the last one appearing in the file -- and it will have a preceding last trailer. The second one to read is the last-but-one appearing in the file -- with a preceding last-but-one trailer. Etc.pp.... If you'll have to read more than one trailer/xref section, each one you read has to contain a reference to the next one to read.

Should you think of "comments" being something you can freely insert into the PDF without corrupting its structure: then think different. Once you inserted comments, you have to update at least the xref table (and maybe the /Length keys of objects).


Update 2: The trailer<<...>> dictionary Jeremey constructed is probably not even a valid dictionary at all, therefor it's also not a valid trailer dictionary...

Anyway, according to the spec, the trailer dictionary must consist of "a series of key-value pairs". The 'legal' keys in the trailer dictionary are limited to a quite narrow set, some of which are even optional (see Table 15 in Section 7.5.5).

Jermey seems to have constructed his example in a way so to (mis-)understand this snippet as a potentially valid trailer dictionary:

trailer<<%) >>
% test test )

Which of course isn't a dictionary at all, since we don't see any key-value pair here.

His full example also isn't valid either because the "key" called /Key isn't amongst the valid key names for the trailer (which are, according to table 15: /Size, /Prev, /Root, /Encrypt, /Info, /ID, /XRefStm).

So Jeremy should do in his PDF parsing code the same that all sane and even most insane PDF processing libraries do: give up on obviously invalid constructs instead of searching sense in them and tell the user that "your damn PDF is corrupt because we cannot identify valid keys in the supposed trailer section of the file".



回答2:

Q: Doc, it hurts when I do this.
A: Don't do that.

The correct way to parse the end of a PDF goes something like this:

  1. Find the last startxref
  2. Back up to that byte offset and start parsing xref table entries
  3. After the last xref table, parse out the trailer.

You don't really have to parse out the object numbers and byte offsets and so forth if you're just trying to find the trailer. All you need to do is look to see how many entries are in a given subsection of the xref, skip 20*N bytes, and check for another subsection (or "trailer"). When you finally hit "trailer" instead of numbers, you're there.

So why on Earth do you just want the trailer?


When I when hunting through the PDF Reference, I expected to find some line of text stating that the header/body/xref/trailer had to be in that order. I did not.

What I DID find, was this:

A basic conforming PDF file shall be constructed of following four elements (see Figure 2):
- A one-line header...
- A body...
- A cross-reference table...
- A trailer...

There are bullets in front of these sections, not numbers.

So that all hints that a conforming PDF can get away with swapping the order of the body and xref. On the other hand, the header is required to be first, the trailer is required to be last, and all the section of a PDF are listed in that order. This implies order, but won't hold up in court.

But if you look at Figure 2 (of chapter 7, section 5.1), entitled "Initial Structure of a PDF file", you'll see the order defined visually. That's a tad thin, but I'll cling to it anyway.

I wouldn't be at all surprised to find that a PDF that put its body after the xref table broke some PDF viewers (particularly a malformed PDF where the program tried to fix it).

I've been working with PDF files for well over a decade. In all that time, I have never seen a PDF where the xref came before the body. And I've seen some REALLY screwed up PDFs.

So while my "correct way to parse a PDF" may not be Iron Clad, it's still pretty durable.


And if you absolutely insist on backing up to find the keyword "trailer", then you can look for "close an array or dictionary" tokens after you parse out the trailer you found. If it were wrapped in a string, all the name slashes would have to be escaped, leading to Bad Parsing. You can't have spaces in a Name... so that leaves just array and dictionary.

But the odds of you ever encountering this problem in Real Life are astronomically small, unless you set out to break PDF software and create these PDFs yourself. That would bring your motives into question.



回答3:

Jeremy has repeatedly edited his question and example code. This made my original answer and some of my original comments partially invalid and missing the point.

Fact is (and a well-known one amongst people in the prepress trade and industry): Adobe does in quite a few instances silently and without a warning process and display PDF files which do not pass a strict validity checker.

Jeremy seems to have constructed such a case. His latest example would make any PDF parser interprete the following snippet as being the trailer (I stripped comments):

trailer<<
  /Size 4
  /Root 2 0 R
  /Info 3 0 R
>>

However, taking the info in this trailer will lead to the parser looking for the /Root at object 2 (while object 2 in fact is of /Type /Pages when it should be of /Type /Catalog for being the root object).

As a consequence, the PDF interpreter would have to

  • (a) either continue searching for another instance of a trailer on the chance that the next one does contain legitimate PDF info,
  • (b) or give up on processing the file and throw an error.

Adobe seems to follow alternative (a).

Ghostscript seems to follow alternative (b).


Note, that according to my byte-counting, Jeremy's PDF example has one more problem: its xref-table is invalid. It has only 16 bytes per line instead of 20. From the PDF spec document:

[....] the cross-reference entries themselves, one per line. Each entry shall be exactly 20 bytes long, including the end-of-line marker. There are two kinds of cross-reference entries: one for objects that are in use and another for objects that have been deleted and therefore are free. Both types of entries have similar basic formats, distinguished by the keyword n (for an in-use entry) or f (for a free entry). The format of an in-use entry shall be:

nnnnnnnnnn ggggg n eol

where:

nnnnnnnnnn shall be a 10-digit byte offset in the decoded stream
ggggg shall be a 5-digit generation number
n shall be a keyword identifying this as an in-use entry
eol shall be a 2-character end-of-line sequence

The byte offset in the decoded stream shall be a 10-digit number, padded with leading zeros if necessary, giving the number of bytes from the beginning of the file to the beginning of the object.

So to make Jeremy's xref table a valid one, it should be padded with 2 more leading '0' and read:

xref
0 4
0000000000 65535 f 
0000000110 00000 n 
0000000250 00000 n 
0000000315 00000 n 
0000000576 00000 n 

However, adding these 2 '0' to each xref line, also offsets each object by 10 more bytes, so the nnnnnnnnnn figures should also be corrected (being lazy, I didn't do it).

So Acrobat did open the constructed file of Jeremy (without any warning)

  • (1) despite the invalid trailer definition, and
  • (2) despite of the glaringly un-compliant xref table.

This adds two more proofs to what I stated in my second paragraph: Adobe's PDF parsing accepts files which violate Adobe's own PDF standard.

This is unfortunate. It lets get away lazy developers writing sloppy code which emits non-compliant PDF files without punishment. The fact that Adobe doesn't outright reject such crappy files may be in the interest of "user friendlyness", but promotes violations to the standard. At the very least, Adobe should always issue warnings when encountering such stuff.

Since Jeremy seems to go writing a PDF parser that wants to cover all corner cases, his users should hope that he at least warns them if it encounters shitty PDFs.

In any case: I've seen a lot of uncompliant PDF files emitted by crappy PDF generators. But so far I never encountered one which had comments sprinkled into its trailer section. So trying to cover corner cases should possibly start with lower hanging fruits than this.



回答4:

I think I have found the solution. After extensive testing, and other things, with Adobe, I have found that what adobe does, is find the last known construct that can be parsed, and work from there, forward. Then it finds the last trailer that can be parsed correctly. So even if there is a correct root node that in trailer before the last valid trailer that can be parsed, if the root in the last trailer is invalid, it'll still fail. Would also be good to note, that this is still token based parsing forward. as trailers between () are ignored, so are trailers between stream/endstream's unless that stream has an invalid length, or a length specified in an obj after the stream (as these objects are not specified in the xref table). Now Adobe seems to take it that extra step further, by actually finding trailers in "gaps" in the xref table as well, this doesn't conform to the current spec model, as trailer is found at the end, and not in the body or xref table. So what I think is the best model, is to get the largest offset of the xref table, and the location of the xref table, if the xref table is after largest offset of an object, then use that, and work forward from there. This will allow me to correctly parse strings and comments without worrying. Thanks for everyone's help in this matter. Hopefully this helps people build a more robust PDF parser as well.



回答5:

The trailer dictionary follows the xref section. Based on the startxref value, you jump to the beginning of xref section. After you read the xref section, you will reach the trailer dictionary. The trailer keyword is always the first on its line (white spaces are allowed in front of it). PDF files allow incremental updates, so you can encounter PDF files with multiple xref sections and trailers, but the processing rule is the same, first process the xref section and then the trailer. If the file includes incremental updates, the trailer section will include a reference to the previous xref section.



标签: parsing pdf