Extract Editable Fields from a PDF in Objective-C

2019-04-08 10:18发布

问题:

I've been researching working with PDFs in my iOS app for a while now. I've figured out a few pieces of the puzzle like scanning for operators and displaying the PDF in a UIWebView. However, what I really need to do is identify editable fields within a PDF document.

Ideally I would like to be able to interact with the fields directly but that sounds very difficult and not an obvious first step. I am already interfacing with a Windows service that can manipulate PDFs in this way and could settle for identifying the editable fields, gathering the field data from the user in a form view, and POSTing that data back to the server. The problem is that I can't see how to identify the fields. I'm interacting with government issued PDFs such as I-9s and W-4s so I do not have control over the creation of the PDFs or the naming of the fields. That is why I need to extract them dynamically. Any help and/or references would be appreciated.

I'm using [this reference](https://developer.apple.com/library/mac/#documentation/graphicsimaging/conceptual/drawingwithquartz2d/dq_pdf_scan/dq_pdf_scan.html"PDF Document Parsing") from Apple's Quatrz 2D Programming guide to trigger operator callbacks when scanning a PDF but that isn't helping me find the editable fields.

I'm also simply loading a UIWebView with the PDF data to display to the user.

[_webView loadData:decodedData MIMEType:@"application/pdf" textEncodingName:@"utf-8" baseURL:nil];

UPDATE:

I built a PDF Helper class (shown below) to traverse all possible object types in the catalog. Originally I was not handling nested dictionaries within arrays so I was not seeing the form fields. Once I fixed that I realized that there were parent references that I had to account for to avoid circular recursive calls that would start an infinite loop. The code below shows a wealth of information from the document catalog. Now I just need to parse it to isolate the form fields I need.

PDFHelper.h

#import <Foundation/Foundation.h>

id selfClass;

@interface PDFHelper : NSObject

@property (nonatomic, strong) NSData *pdfData;
@property (nonatomic, strong) NSMutableDictionary *pdfDict;
@property (nonatomic) int catalogLevel;


-(NSArray *) copyPDFArray:(CGPDFArrayRef)arr referencingDictionary:(CGPDFDictionaryRef)dict referencingKey:(const char *)key;
-(NSArray *) getFormFields;
-(CGPDFDictionaryRef) getDocumentCatalog;

@end

PDFHelper.m

#import "PDFHelper.h"
#import "FileHelpers.h"
#import "Log.h"

@implementation PDFHelper

@synthesize pdfData = _pdfData;
@synthesize pdfDict = _pdfDict;
@synthesize catalogLevel = _catalogLevel;

-(id)init
{
    self = [super init];
    if(self)
    {
        selfClass = self;
        _pdfDict = [[NSMutableDictionary alloc] init];
        _catalogLevel = 1;
    }

    return self;
}

-(NSArray *) getFormFields
{
    CGPDFDictionaryRef acroForm = NULL;
    if (CGPDFDictionaryGetDictionary([self getPdfDocDictionary], "AcroForm", &acroForm))
        CGPDFDictionaryApplyFunction(acroForm, getDictionaryObjects, acroForm);
    return [_pdfDict objectForKey:@"XFA"];
}

-(CGPDFDictionaryRef) getDocumentCatalog
{
    CGPDFDictionaryRef docCatalog = [self getPdfDocDictionary];
    CGPDFDictionaryApplyFunction(docCatalog, getDictionaryObjects, docCatalog);
    return docCatalog;
}

-(CGPDFDictionaryRef) getPdfDocDictionary
{
    NSURL *pdf = [[NSURL alloc] initFileURLWithPath:[FileHelpers pathInLibraryDirectory:@"file.pdf"]];

    [_pdfData writeToFile:[pdf path] atomically:YES];

    CGPDFDocumentRef pdfDocument = CGPDFDocumentCreateWithURL((__bridge CFURLRef)pdf);
    CGPDFDictionaryRef returnDict = CGPDFDocumentGetCatalog(pdfDocument);
    return returnDict;
}

void getDictionaryObjects (const char *key, CGPDFObjectRef object, void *info) {

    NSString *logString = [[NSString alloc] initWithString:[NSString stringWithFormat:@"key: %s", key]];
    for (int i = 0; i < [selfClass catalogLevel]; i++)
        logString = [NSString stringWithFormat:@"-%@", logString];
    [Log LogDebug:logString];

    CGPDFDictionaryRef contentDict = (CGPDFDictionaryRef)info;

    CGPDFObjectType type = CGPDFObjectGetType(object);
    switch (type) {
        case kCGPDFObjectTypeNull: {            
                [Log LogDebug:[NSString stringWithFormat:@"*****pdf null value"]];
            break;
        }
        case kCGPDFObjectTypeBoolean: {
            CGPDFBoolean objectBoolean;
            if (CGPDFObjectGetValue(object, kCGPDFObjectTypeBoolean, &objectBoolean)) {
                NSString *logString = [[NSString alloc] initWithString:[NSString stringWithFormat:@"pdf boolean value: %@", [NSNumber numberWithBool:objectBoolean]]];
                for (int i = 0; i < [selfClass catalogLevel]; i++)
                    logString = [NSString stringWithFormat:@"-%@", logString];
                [Log LogDebug:logString];
                [[selfClass pdfDict] setObject:[NSNumber numberWithBool:objectBoolean]
                                        forKey:[NSString stringWithCString:key encoding:NSUTF8StringEncoding]];
            }
            break;
        }
        case kCGPDFObjectTypeInteger: {
            CGPDFInteger objectInteger;
            if (CGPDFObjectGetValue(object, kCGPDFObjectTypeInteger, &objectInteger)) {
                NSString *logString = [[NSString alloc] initWithString:[NSString stringWithFormat:@"pdf integer value: %ld", (long int)objectInteger]];
                for (int i = 0; i < [selfClass catalogLevel]; i++)
                    logString = [NSString stringWithFormat:@"-%@", logString];
                [Log LogDebug:logString];
                [[selfClass pdfDict] setObject:[NSNumber numberWithInt:objectInteger]
                                        forKey:[NSString stringWithCString:key encoding:NSUTF8StringEncoding]];
            }
            break;
        }
        case kCGPDFObjectTypeReal: {
            CGPDFReal objectReal;
            if (CGPDFObjectGetValue(object, kCGPDFObjectTypeReal, &objectReal)) {
                NSString *logString = [[NSString alloc] initWithString:[NSString stringWithFormat:@"pdf real value: %ld", (long int)objectReal]];
                for (int i = 0; i < [selfClass catalogLevel]; i++)
                    logString = [NSString stringWithFormat:@"-%@", logString];
                [Log LogDebug:logString];
                [[selfClass pdfDict] setObject:[NSNumber numberWithInt:objectReal]
                                        forKey:[NSString stringWithCString:key encoding:NSUTF8StringEncoding]];
            }
            break;
        }
        case kCGPDFObjectTypeName: {
            const char *name;
            if (CGPDFDictionaryGetName(contentDict, key, &name))
            {
                NSString *dictName = [[NSString alloc] initWithCString:name encoding:NSUTF8StringEncoding];
                if (dictName)
                {
                    NSString *logString = [[NSString alloc] initWithString:[NSString stringWithFormat:@"pdf name value: %@", dictName]];
                    for (int i = 0; i < [selfClass catalogLevel]; i++)
                        logString = [NSString stringWithFormat:@"-%@", logString];
                    [Log LogDebug:logString];
                    [[selfClass pdfDict] setObject:dictName
                                            forKey:[NSString stringWithCString:key encoding:NSUTF8StringEncoding]];
                }
            }
            break;
        }
        case kCGPDFObjectTypeString: {
            CGPDFStringRef objectString;
            if (CGPDFObjectGetValue(object, kCGPDFObjectTypeString, &objectString)) {
                NSString *logString = [[NSString alloc] initWithString:[NSString stringWithFormat:@"pdf string value: %@", (__bridge NSString *)CGPDFStringCopyTextString(objectString)]];
                for (int i = 0; i < [selfClass catalogLevel]; i++)
                    logString = [NSString stringWithFormat:@"-%@", logString];
                [Log LogDebug:logString];
                [[selfClass pdfDict] setObject:(__bridge NSString *)CGPDFStringCopyTextString(objectString)
                                        forKey:[NSString stringWithCString:key encoding:NSUTF8StringEncoding]];
            }
            break;
        }
        case kCGPDFObjectTypeArray: {
            CGPDFArrayRef objectArray;
            if (CGPDFObjectGetValue(object, kCGPDFObjectTypeArray, &objectArray)) {
                NSArray *myArray=[selfClass copyPDFArray:objectArray referencingDictionary:contentDict referencingKey:key];
                [[selfClass pdfDict] setObject:myArray
                                        forKey:[NSString stringWithCString:key encoding:NSUTF8StringEncoding]];

            }
            break;
        }
        case kCGPDFObjectTypeDictionary: {
            CGPDFDictionaryRef objectDictionary;
            if (CGPDFObjectGetValue(object, kCGPDFObjectTypeDictionary, &objectDictionary)) {
                NSString *logString = @"Found dictionary";
                for (int i = 0; i < [selfClass catalogLevel]; i++)
                    logString = [NSString stringWithFormat:@"-%@", logString];
                //[Log LogDebug:logString];
                NSString *keyCheck = [[NSString alloc] initWithUTF8String:key];
                if (![keyCheck isEqualToString:@"Parent"] && ![keyCheck isEqualToString:@"P"])
                {
                    [selfClass setCatalogLevel:[selfClass catalogLevel] + 1];
                    CGPDFDictionaryApplyFunction(objectDictionary, getDictionaryObjects, objectDictionary);
                    [selfClass setCatalogLevel:[selfClass catalogLevel] - 1];
                }
            }
            break;
        }
        case kCGPDFObjectTypeStream: {
            CGPDFStreamRef objectStream;
            if (CGPDFObjectGetValue(object, kCGPDFObjectTypeStream, &objectStream)) {

                CGPDFDictionaryRef dict = CGPDFStreamGetDictionary( objectStream );

                CGPDFDataFormat fmt = CGPDFDataFormatRaw;
                CFDataRef streamData = CGPDFStreamCopyData(objectStream, &fmt);
                NSData *data = [[NSData alloc] initWithData:(__bridge NSData *)(streamData)];
                [data writeToFile:[FileHelpers pathInDocumentDirectory:@"data.dat"] atomically:YES];
                NSString *dataString = [[NSString alloc] initWithData:data encoding:NSUTF8StringEncoding];
                //if (!dataString) {
                //    dataString = [[NSString alloc] initWithData:(__bridge NSData *)(streamData) encoding:NSUTF16StringEncoding];
               // }

                NSString *logString = [[NSString alloc] initWithString:[NSString stringWithFormat:@"pdf stream length: %ld - %@", (long int)CFDataGetLength( streamData ), dataString]];

                for (int i = 0; i < [selfClass catalogLevel]; i++)
                    logString = [NSString stringWithFormat:@"-%@", logString];
                [Log LogDebug:logString];

                NSString *keyCheck = [[NSString alloc] initWithUTF8String:key];
                if( dict && ![keyCheck isEqualToString:@"Parent"] && ![keyCheck isEqualToString:@"P"])
                {
                    [selfClass setCatalogLevel:[selfClass catalogLevel] + 1];
                    CGPDFDictionaryApplyFunction(dict, getDictionaryObjects, dict);
                    [selfClass setCatalogLevel:[selfClass catalogLevel] - 1];
                }
            }
        }
    }
}

- (NSArray *)copyPDFArray:(CGPDFArrayRef)arr referencingDictionary:(CGPDFDictionaryRef)dict referencingKey:(const char *)key
{
    int i = 0;
    NSMutableArray *temp = [[NSMutableArray alloc] init];

    NSString *logString = [[NSString alloc] initWithString:[NSString stringWithFormat:@"pdf array count: %zu", CGPDFArrayGetCount(arr)]];
    for (int i = 0; i < [selfClass catalogLevel]; i++)
        logString = [NSString stringWithFormat:@"-%@", logString];
    [Log LogDebug:logString];

    for(i=0; i<CGPDFArrayGetCount(arr); i++){
        CGPDFObjectRef object;
        CGPDFArrayGetObject(arr, i, &object);
        CGPDFObjectType type = CGPDFObjectGetType(object);
        switch(type){
            case kCGPDFObjectTypeNull: {
                NSString *logString = [[NSString alloc] initWithString:[NSString stringWithFormat:@"pdf array null(%d)", i]];
                for (int i = 0; i < [selfClass catalogLevel]; i++)
                    logString = [NSString stringWithFormat:@"-%@", logString];
                [Log LogDebug:logString];
                break;
            }
            case kCGPDFObjectTypeBoolean: {
                CGPDFBoolean objectBool;
                if (CGPDFObjectGetValue(object, kCGPDFObjectTypeBoolean, &objectBool)) {
                    NSString *logString = [[NSString alloc] initWithString:[NSString stringWithFormat:@"pdf array boolean value(%d): %@", i, [NSNumber numberWithBool:objectBool]]];
                    for (int i = 0; i < [selfClass catalogLevel]; i++)
                        logString = [NSString stringWithFormat:@"-%@", logString];
                    [Log LogDebug:logString];
                    [temp addObject:[NSNumber numberWithBool:objectBool]];
                }
                break;
            }
            case kCGPDFObjectTypeInteger: {
                CGPDFInteger objectInteger;
                if (CGPDFObjectGetValue(object, kCGPDFObjectTypeInteger, &objectInteger)) {
                    NSString *logString = [[NSString alloc] initWithString:[NSString stringWithFormat:@"pdf array integer value(%d): %ld", i, (long int)objectInteger]];
                    for (int i = 0; i < [selfClass catalogLevel]; i++)
                        logString = [NSString stringWithFormat:@"-%@", logString];
                    [Log LogDebug:logString];
                    [temp addObject:[NSNumber numberWithInt:objectInteger]];
                }
                break;
            }
            case kCGPDFObjectTypeReal:
            {
                CGPDFReal objectReal;
                if (CGPDFObjectGetValue(object, kCGPDFObjectTypeReal, &objectReal))
                {
                    NSString *logString = [[NSString alloc] initWithString:[NSString stringWithFormat:@"pdf array real(%d): %ld", i, (long int)objectReal]];
                    for (int i = 0; i < [selfClass catalogLevel]; i++)
                        logString = [NSString stringWithFormat:@"-%@", logString];
                    [Log LogDebug:logString];
                    [temp addObject:[NSNumber numberWithInt:objectReal]];
                }
                break;
            }
            case kCGPDFObjectTypeName:
            {
                const char *name;
                if (CGPDFDictionaryGetName(dict, key, &name))
                {
                    NSString *dictName = [[NSString alloc] initWithCString:name encoding:NSUTF8StringEncoding];

                    if (dictName)
                    {
                        NSString *logString = [[NSString alloc] initWithString:[NSString stringWithFormat:@"pdf array name value(%d): %@", i, dictName]];
                        for (int i = 0; i < [selfClass catalogLevel]; i++)
                            logString = [NSString stringWithFormat:@"-%@", logString];
                        [Log LogDebug:logString];
                        [[selfClass pdfDict] setObject:dictName
                                                forKey:[NSString stringWithCString:key encoding:NSUTF8StringEncoding]];
                    }
                }
                break;
            }
            case kCGPDFObjectTypeString:
            {
                CGPDFStringRef objectString;
                if (CGPDFObjectGetValue(object, kCGPDFObjectTypeString, &objectString))
                {
                    NSString *tempStr = (__bridge NSString *)CGPDFStringCopyTextString(objectString);
                    NSString *logString = [[NSString alloc] initWithString:[NSString stringWithFormat:@"pdf array string(%d): %@", i, tempStr]];
                    for (int i = 0; i < [selfClass catalogLevel]; i++)
                        logString = [NSString stringWithFormat:@"-%@", logString];
                    [Log LogDebug:logString];
                    [temp addObject:tempStr];
                }
                break;
            }
            case kCGPDFObjectTypeArray :
            {
                CGPDFArrayRef objectArray;
                if (CGPDFObjectGetValue(object, kCGPDFObjectTypeArray, &objectArray))
                {
                    NSArray *tempArr = [selfClass copyPDFArray:objectArray referencingDictionary:dict referencingKey:key];
                    [temp addObject:tempArr];
                }
                break;
            }
            case kCGPDFObjectTypeDictionary :
            {
                CGPDFDictionaryRef objectDict;
                NSString *keyCheck = [[NSString alloc] initWithUTF8String:key];
                if (CGPDFObjectGetValue(object, kCGPDFObjectTypeDictionary, &objectDict) && ![keyCheck isEqualToString:@"Parent"] && ![keyCheck isEqualToString:@"P"])
                {
                    [selfClass setCatalogLevel:[selfClass catalogLevel] + 1];
                    CGPDFDictionaryApplyFunction( objectDict, getDictionaryObjects,  objectDict);
                    [selfClass setCatalogLevel:[selfClass catalogLevel] - 1];
                }
                break;
            }
            case kCGPDFObjectTypeStream :
            {
                CGPDFStreamRef objectStream;
                if (CGPDFObjectGetValue(object, kCGPDFObjectTypeStream, &objectStream))
                {
                    CGPDFDictionaryRef streamDict = CGPDFStreamGetDictionary( objectStream );
                    CGPDFDataFormat fmt = CGPDFDataFormatRaw;
                    CFDataRef streamData = CGPDFStreamCopyData(objectStream, &fmt);
                    NSString *dataString = [[NSString alloc] initWithData:(__bridge NSData *)(streamData) encoding:NSUTF8StringEncoding];

                    NSString *logString = [[NSString alloc] initWithString:[NSString stringWithFormat:@"pdf array stream length: (%d): %ld - %@", i, (long int)CFDataGetLength( streamData ), dataString]];

                    for (int i = 0; i < [selfClass catalogLevel]; i++)
                        logString = [NSString stringWithFormat:@"-%@", logString];
                    [Log LogDebug:logString];


                    NSString *keyCheck = [[NSString alloc] initWithUTF8String:key];
                    if( streamDict && ![keyCheck isEqualToString:@"Parent"] && ![keyCheck isEqualToString:@"P"])
                    {
                        [selfClass setCatalogLevel:[selfClass catalogLevel] + 1];
                        CGPDFDictionaryApplyFunction( streamDict, getDictionaryObjects, streamDict );
                        [selfClass setCatalogLevel:[selfClass catalogLevel] - 1];
                    }
                }
            }

        }
    }
    return temp;
}

@end

回答1:

With "editable fields" you mean the type of form elements that can be filled in using Acrobat or Adobe Reader?

Those fields are not part of the actual page description. If you look at the PDF Specification document, you'll find a description of "Interactive Forms" in chapter 12.7 that explains that field dictionaries for a document are stored starting from an element called "AcroForm" in the document catalogue.

iOS does give you access to the document catalog as far as I know so you would have to find the "AcroForm" field in that catalog dictionary and then descend into the field dictionary structure to collect the information you want. All fields from the complete document are stored in a hierarchical fashion in this place.