PDF specifications
First of all some words on the documents specifying the PDF format.
Originally PDF was a proprietary Adobe format. They published the PDF References from early on enabling other companies to create and process PDF files but they made it clear that they did not consider the references normative in nature; according to Leonard Rosenthol, Adobe's PDF evangelist:
the PDF References aren't "normative" in nature - they don't (usually) make final, definitive statements - just sort of general ones.
(in an answer on the iText mailing list made December 15th, 2008)
In 2008, though, Adobe eventually gave the PDF format into the hands of ISO to have it become actually standardized as ISO 32000-1.
When the PDF became an ISO standard, a new feature was added to it, people could define "Extensions" to the PDF format and got a standardized way to document in a PDF that such extensions were used in it, cf. ISO 32000-1 section 7.12.
The most notable extensions are those published by Adobe themselves and the PAdES extensions by ETSI, cf. ETSI EN 319 142-1 V1.1.1.
Last year, 2017, the PDF 2.0 specification ISO 32000-2 has been published.
Your questions
1. P value in FieldMDP transform dictionary
As per page 736 of the spec, FieldMDP has no P parameter, that belongs to DocMDP (page 733). Of course, the PDF might have been modified by some third party who added an extraneous key to the dictionary. But I just want to confirm, if a P key is found in FeldMDP, is it to be ignored, or it has some meaning?
This is a tough one...
First of all please have a look at the "Signature Fields" section under 8.6.3 Field Types on page 696 and 697, in particular the Lock entry in table 8.81 the value of which is a dictionary with the values in table 8.82.
If such a Lock entry exists in a yet-unsigned signature field, the Lock dictionary during signing essentially is copied (with updated Type and V values) to become the FieldMDP transform parameters dictionary explained on page 736 in table 8.106.
Now the Adobe supplement to ISO 32000, extension level 3, adds a new entry to the lock dictionary entries in table 8.82, the P entry! This entry upon filling in the signature field in question can be used to decrease the original P level of a previous DocMDP signature.
By the above mentioned mechanism of using a copy of the Lock dictionary as FieldMDP transform parameters dictionary, the P value now finds itself among the FieldMDP transform parameters.
Unfortunately it was forgotten in that supplement to also update the transform parameter specification...
So NO, you cannot ignore the P entry (if you want to handle DocMDP and FieldMDP transformations at all, that is).
2. & 4. byte range array vs. recursive object digest
As per page 725 of the spec, there are two reproducible ways to compute digest of the content - via byte range array, or via recursive object digest computation, as prescribed in the TransformParams entry in the signature reference dictionary. Question is, in this document, both are present. What is the purpose of the FieldMDP entry at all?
...
What is the purpose of the DigestLocation and DigestValue key-value pairs in the FieldMDP dictionary? The digest value is already provided in the Contents key of the V dictionary, right? Plus this is present within an array (Reference), here there is only one entry, what if there are multiple entries?
First of all have a look at the Errata for the PDF Reference, sixth edition, version 1.7:
Page 725
Add the following paragraph after the third paragraph
PDF 1.5 specified a method for computing an object digest over a subtree of objects in memory and storing the resulting digest in entries named DigestValue and DigestLocation in the signature reference dictionary. (The digest was documented in Appendix I, "Computation of Object Digests.") This method is deprecated and should not be used. All mentions of objects digests in section 8.7, "Digital Signatures", should be disregarded.
Your document is from a time when there still were older Adobe Reader versions in use, so for compatibility object digests are present but also the preferred combination of byte range digest and FieldMDP transform information.
In ISO 32000-1 there are no mentions of the object digest anymore.
3. Signature Contents and SubFilter
If I understand correctly, the value of the Contents key is the encrypted digest of the content excluding the content value stream (itself). So I first have to decrypt it. Then I have to compute the digest of the actual content excluding the content stream, and compare the two digests to see if they match. Is that right? If so, how do I do that? I suspect the Filter and SubFilter keys denote the methods, but I am unable to understand how exactly.
For this you should read the section 8.7.2 "Signature Interoperability" in the PDF Reference. But please be aware that the SubFilter values adbe.x509.rsa_sha1 and adbe.pkcs7.sha1 (and the associated mechanisms) have been deprecated with PDF 2.0 while ETSI.CAdES.detached and ETSI.RFC3161 have been added the specification of which can also be found in ETSI EN 319 142-1 V1.1.1.
5. Digests
As per page 1131 of the spec, the digest length is 16 bytes or 20 bytes. How can a huge PDF be squeezed into such a small digest? Isn't 16 too small a number to guarantee that no two different PDFs will have the same digest?
Obviously cryptographic hash functions can never guarantee that no collisions happen. To be considered good, though, they must be able to claim that such a collision is very improbable and that constructing a collision is difficult.
Random collisions are very improbable for the MD5 and SHA-1 algorithm with their 16 or 20 bytes. That is not the issue.
The problem is that they meanwhile are considered insecure concerning the difficulty to construct collisions.
Concerning the section, though, you found this in, Appendix I, the Errata say:
Page 1131
Add the following sentence after the first sentence in this appendix:
This method for detecting modifications is deprecated and should not be used. Additionally, the description of the algorithm is known to contain significant errors.
In particular in the light of the latter sentence it makes no sense to even try implementing object digest calculation.
6. Invisible signatures
I understand that the digital signature in the said PDF is through a signature field. Is it possible for a digital signature to not be a signature field, that is, no associated Rect entry in the Fields array entry?
This actually are two questions:
a Signatures can be invisible, i.e. have no in-document visualization. Already the old Reference you use says so in the "Signature Fields" section under 8.6.3 Field Types on page 696:
The annotation rectangle (Rect) in such a dictionary gives the position of the field on its page. Signature fields that are not intended to be visible should have an annotation rectangle that has zero height and width.
b There even are signatures which are not the value of some signature field, the usage rights signatures, cf. section in 8.7 Digital Signatures in your Reference on page 726:
At most two usage rights signatures (PDF 1.5). Its signature dictionary is referenced from the UR or UR3 (PDF 1.6) entry in the permissions dictionary (not from a signature field);
7. the Filter key
What exactly is the role of the Filter key? I can see that the SubFilter key determines which scheme to use while decrypting the content, what does the Filter key signify? The spec says that it is a signature handler. What exactly is that? What does it additionally say that the SubFilter value does not?
Indeed nowadays the Filter entry has become meaningless in interoperable PDF signature processing.
In the early days of PDF signatures, though, the filter indeed was important, it represented a handler, a module of the PDF Reader you needed to process the signature in any way. Some such handlers were installed with the Reader, others you had to install separately. Different such signature handlers supported completely different mechanisms.
But as time went by certain mechanisms turned out to be in general use and others not, and the Adobe handler started supporting all these generally used mechanisms. These standard mechanisms are those you find in the section 8.7.2 Signature Interoperability.
At the same time separate PDF signing software products by others than Adobe started not using an own handler identifier but instead the Adobe handler name.
Thus, nowadays one can generally simply ignore the Filter value.
A. How to verify?
In a comment you asked
What do I do with the MDP information in the above PDF? I have the ByteRange object, from which I can compute the total byte content of the PDF excluding the Contents key. I calculate its hash via MD5, then decrypt the Contents value with the subfilter info. Then I compare. Where exactly do I need the MDP info for? And secondly, the Reference dictionary is optional, if it were optional, how would I know that I need to apply MD5 to compute the hash of the PDF content?
What you have to do...
Check whether the ByteRange entry describes the whole file except the Contents value. Cf. page 740, "For byte range signatures, Contents is a hexadecimal string with “<” and “>” delimiters. It must fit precisely in the space between the ranges specified by ByteRange."
The SubFilter is adbe.pkcs7.detached, thus the contents of the Contents value are a PKCS#7 signature container. Parse this PKCS#7 container and determine the hash algorithm used in the single SignerInfo object in it. This is SHA-1 here.
Calculate the SHA-1 hash of the ranges described by the ByteRange value.
Compare that hash value with the value of the signed messageDigest attribute of the SignerInfo object in the PKCS#7 container.
If these hashes do not match, the data has been manipulated since signing.
Determine the signer certificate of the SignerInfo signature.
Which criteria you have to use here, may depend on the technical and legal context in which the signature has been generated. E.g. in modern contexts ESS attributes have to exist an match.
Verify whether the signed attributes in the SignerInfo object are correctly signed by the signature value therein.
If they are not, the PKCS#7 container has been manipulated since signing.
Verify whether the signer certificate was valid at the appropriate validation time.
This is even more specific to PKI policy and legal framework. E.g. is the appropriate validation time the current time or the best time determined as signing time? Is the signed signing-time attribute / the PDF signature dictionary M value trustworthy or must there be a proof of existence? Which validation model shall be used for the certificate chain? What are your accepted trust anchors? What about revocation information...
Your signature has field MDP and in it a stricter doc MDP value. Thus you have to check whether there are any incremental updates after the signed revision, and if there are, whether they contain disallowed changes to the document content.
I might have forgotten some checks...
How do I check if incremental updates have been made to the PDF since the signer signed the document? I know with each incremental update a new trailer and xref entry is added, but how do I know which ones were part of the original content when signed and which ones have been aded since then?
Ah, ok, I see I have to change the description of verification step 1 above, Check whether the ByteRange entry describes the whole file except the Contents value. Actually you have to check whether the complete range from the start of the file to the end of the higher byte range part constitute a valid PDF file containing the signature (and, of course, that the byte range gap is the Contents value).
If the actual PDF is larger, you can assume that the originally signed PDF is the identified starting section and that everything thereafter are incremental updates to the signed file.
And how did you conclude that the hashing algorithm here is SHA1? Doesn't the reference dictionary mention MD5? The subfilter is adbe.pkcs7.detached, how do you know that the hashing algo is not, for example, SHA256? If it was adbe.pkcs7.sha1, I would understand.
Yes, the reference dictionary mentions MD5 but this has nothing to do with the byte range signature. It refers to the object digest which is deprecated (see answer to 2. and 4. above) and for which the only description for its calculation is known to contain significant errors anyways (see answer to 5. above).
As mentioned above in verification step 2, you have to parse this PKCS#7 container and determine the hash algorithm. At the bottom of page 738 of the reference you'll read that "the PKCS#7 object must conform to the PKCS#7 specification in Internet RFC 2315." You can find this RFC here. The newer ISO norms instead reference the Cryptographic Message Syntax (CMS) according to RFC 3852 or RFC 5652 instead which reflect the further development stages of PKCS#7 containers.
When you extract the signature container and analyze its structure according to the RFCs mentioned above, you'll find that its SignerInfo
part (parsed using an ASN.1 dump utility) starts like this:
3867 6086: . . . . SEQUENCE {
3871 1: . . . . . INTEGER 1
3874 75: . . . . . SEQUENCE {
3876 69: . . . . . . SEQUENCE {
3878 11: . . . . . . . SET {
3880 9: . . . . . . . . SEQUENCE {
3882 3: . . . . . . . . . OBJECT IDENTIFIER countryName (2 5 4 6)
: . . . . . . . . . . (X.520 DN component)
3887 2: . . . . . . . . . PrintableString 'US'
: . . . . . . . . . }
: . . . . . . . . }
3891 22: . . . . . . . SET {
3893 20: . . . . . . . . SEQUENCE {
3895 3: . . . . . . . . . OBJECT IDENTIFIER organizationName (2 5 4 10)
: . . . . . . . . . . (X.520 DN component)
3900 13: . . . . . . . . . PrintableString 'GeoTrust Inc.'
: . . . . . . . . . }
: . . . . . . . . }
3915 30: . . . . . . . SET {
3917 28: . . . . . . . . SEQUENCE {
3919 3: . . . . . . . . . OBJECT IDENTIFIER commonName (2 5 4 3)
: . . . . . . . . . . (X.520 DN component)
3924 21: . . . . . . . . . PrintableString 'GeoTrust CA for Adobe'
: . . . . . . . . . }
: . . . . . . . . }
: . . . . . . . }
3947 2: . . . . . . INTEGER 514
: . . . . . . }
3951 9: . . . . . SEQUENCE {
3953 5: . . . . . . OBJECT IDENTIFIER sha1 (1 3 14 3 2 26)
: . . . . . . . (OIW)
3960 0: . . . . . . NULL
: . . . . . . }
3962 1733: . . . . . [0] {
3966 24: . . . . . . SEQUENCE {
3968 9: . . . . . . . OBJECT IDENTIFIER contentType (1 2 840 113549 1 9 3)
: . . . . . . . . (PKCS #9)
3979 11: . . . . . . . SET {
3981 9: . . . . . . . . OBJECT IDENTIFIER data (1 2 840 113549 1 7 1)
: . . . . . . . . . (PKCS #7)
: . . . . . . . . }
: . . . . . . . }
3992 35: . . . . . . SEQUENCE {
3994 9: . . . . . . . OBJECT IDENTIFIER messageDigest (1 2 840 113549 1 9 4)
: . . . . . . . . (PKCS #9)
4005 22: . . . . . . . SET {
4007 20: . . . . . . . . OCTET STRING
: . . . . . . . . . 3F 00 47 E6 CB 5B 9B B0 ?.G..[..
: . . . . . . . . . 89 25 4B 20 D1 74 44 5C .%K .tD\
: . . . . . . . . . 3B A4 F5 13 ;...
: . . . . . . . . }
: . . . . . . . }
Here you can recognize the CMSVersion INTEGER 1
, the SignerIdentifier with subject cn=GeoTrust CA for Adobe, o=GeoTrust Inc., c=US
and serial number INTEGER 514
, and then the DigestAlgorithmIdentifier for sha1 1 3 14 3 2 26
. This is where you get the digest algorithm to apply to the byte ranges.
Thereafter you see the start of the signed attributes, among them the messageDigest 3F 00 47 E6 CB 5B 9B B0 89 25 4B 20 D1 74 44 5C 3B A4 F5 13
. This is the value you compare with in verification step 4.
So if we exclude the MDP checks, the basic signature validation will involve checking just the byte range entry corresponding to the catalog and comparing it against the hash value, right?
To clarify: The mathemetical signature validation will involve calculating the hash value of the byte range and comparing it to the hash value stored in the CMS container (verification steps 2..4) AND verifying whether the signature value in the CMS SignerInfo object properly signs the signed attributes therein (which usually means calculating the hash of the signed attributes, determining the signer certificate, and checking the signature value against the hash value and the public key in the signer certificate) (verification steps 5 and 6).
Subsequent changes will go to the incremental update section if I am not wrong. If a strict DocMDP is enforced, any incremental update will invalidate the signature. Is that accurate?
A DocMDP P value of 1
allows no incremental updates at all. EXCEPT, that is, if your validation policy shall include the ETSI extensions or PDF 2.0: In those cases incremental updates which include only the data necessary to add Document Security Stores (DSS) and/or document timestamps to the document are allowed.
On backward compatibility
In comments you stressed that you will have to support all meanwhile deprecated mechanisms, too, for backward compatibility because Too many PDFs are out there which are from old times.
Even if that were important, I would propose to at least implement the currently used mechanisms first and only thereafter attempt to implement object digests etc.
But actually merely a very small part of those old signatures can still meaningfully be validated because usually they don't come with the required validation related information (CRLs, OCSP responses) and proofs of existence (timestamps) to seriously come up with positive validation results.