How to differentiate if NSData is xls, ppt or doc

2019-07-23 13:59发布

问题:

I'm working on a File-Handling type of app, I recently encountered a bug that is caused by links that doesn't have a file extension like this:

https://drive.google.com/uc?export=download&id=1234567abcdefghijk

I've been basing the file type by the filename located at the end of the link which is the direct link to the file.

In the case of a redirecting link like the google drive link above, it still returns the data but the problem is since it doesn't have a file extension, the UIWebView doesn't render the document types of file (I use a different viewer for image types and it renders quite fine because you can pass the data directly to a UIImage).

The solution I came up with was to check for File Signature which you can find in the first 1024 bytes of the data. I found the file signatures for document types in http://www.filesignatures.net/index.php.

I can differentiate the images and pdf type of files but the problem is the xls/ppt/doc and xlsx/pptx/docx because they have the same file signatures, [D0 CF 11 E0 A1 B1 1A E1] and [50 4B 03 04] respectively.

Now what I want to know is if there are other ways to differentiate those Microsoft Office document files.

This is the code that I've already done, if you know how to enhance this function, I would accept it with some explanation:

typedef enum FileSignature {
    kFileSignaturePDF,
    kFileSignaturePPT_DOC_XLS,
    kFileSignaturePPTX_DOCX_XLSX,
    kFileSignaturePNG,
    kFileSignatureJPG,
    kFileSignatureBMP,
    kFileSignatureUndefined,
}FileSignature;

+ (FileSignature) getDocumentTypeOfData:(NSData *)documentData {

    if ( documentData.length >= 1024 ) {
        const unsigned char pdfBytes[] = {0x25, 0x50, 0x44, 0x46};
        const unsigned char jpgBytes[] = {0xFF, 0xD8, 0xFF, 0xE0};
        const unsigned char pngBytes[] = {0x89, 0x50, 0x4E, 0x47, 0x0D, 0x0A, 0x1A, 0x0A};
        const unsigned char bmpBytes[] = {0x42, 0x4D};
        // pptx,xlsx,docx
        const unsigned char msOfficeXBytes[] = {0x50, 0x4B, 0x03, 0x04};
        // ppt,xls,doc
        const unsigned char msOfficeBytes[] = {0xD0, 0xCF, 0x11, 0xE0, 0xA1, 0xB1, 0x1A, 0xE1};

        NSString *pdfByteString = [[NSString alloc] initWithBytes:pdfBytes length:sizeof(pdfBytes) encoding:NSASCIIStringEncoding];
        NSString *jpgByteString = [[NSString alloc] initWithBytes:jpgBytes length:sizeof(jpgBytes) encoding:NSASCIIStringEncoding];
        NSString *pngByteString = [[NSString alloc] initWithBytes:pngBytes length:sizeof(pngBytes) encoding:NSASCIIStringEncoding];
        NSString *bmpByteString = [[NSString alloc] initWithBytes:bmpBytes length:sizeof(bmpBytes) encoding:NSASCIIStringEncoding];
        NSString *msOfficeXByteString = [[NSString alloc] initWithBytes:msOfficeXBytes length:sizeof(msOfficeXBytes) encoding:NSASCIIStringEncoding];
        NSString *msOfficeByteString = [[NSString alloc] initWithBytes:msOfficeBytes length:sizeof(msOfficeBytes) encoding:NSASCIIStringEncoding];

        NSArray *arrayOfBytesToSearchFor = [[NSArray alloc] initWithObjects:pdfByteString,jpgByteString,pngByteString,bmpByteString, msOfficeByteString, msOfficeXByteString, nil];

        NSString *foundByteString = NULL;

        for (NSString *byteString in arrayOfBytesToSearchFor) {
            const unsigned char *searchForByte = (const unsigned char *) [byteString cStringUsingEncoding:NSASCIIStringEncoding];

            NSData *searchForByteData = [NSData dataWithBytes:searchForByte length:sizeof(searchForByte)];
            NSRange foundRange = [documentData rangeOfData:searchForByteData options:NSDataSearchAnchored range:NSMakeRange(0, 1024)];

            if (foundRange.length > 0) {
                foundByteString = byteString;
                break;
            }
        }

        FileSignature fileType = kFileSignatureUndefined;

        int indexOfFoundByteString = [arrayOfBytesToSearchFor indexOfObject:foundByteString];

        switch (indexOfFoundByteString) {
            case 0:
                fileType = kFileSignaturePDF;
                break;
            case 1:
                fileType = kFileSignatureJPG;
                break;
            case 2:
                fileType = kFileSignaturePNG;
                break;
            case 3:
                fileType = kFileSignatureBMP;
                break;
            case 4:
                fileType = kFileSignaturePPT_DOC_XLS;
                break;
            case 5:
                fileType = kFileSignaturePPTX_DOCX_XLSX;
                break;
            default:
                fileType = kFileSignatureUndefined;
                break;
        }

        return fileType;
    }

    return kFileSignatureUndefined;
}

回答1:

Took me a while to post this, but I went down on trojanfoe's idea of getting the content-type in the response header, if you are using AFNetworking 2.0 then on the success block you can get the content-type by operation.response.allHeaderFields, allHeaderFields is also a property of NSHTTPURLResponse for those doing the manual NSURLConnection way.

If you can do some improvements in this, be it optimization or lesser line of code or additions in the list of supported documents, I suggest you post an answer.

typedef enum DocumentType {
    kDocumentTypePDF,
    kDocumentTypePPT,
    kDocumentTypeDOC,
    kDocumentTypeXLS,
    kDocumentTypePPTX,
    kDocumentTypeDOCX,
    kDocumentTypeXLSX,
    kDocumentTypePNG,
    kDocumentTypeJPG,
    kDocumentTypeBMP,
    kDocumentTypeIMG,
    kDocumentTypeUndefined,
}DocumentType;

+ (DocumentType) getDocumentTypeBasedOnContentType:(NSString *)contentType {

    if ( [contentType isEqualToString:@"application/pdf"] ) {
        return kDocumentTypePDF;
    } else if ( [contentType isEqualToString:@"application/mspowerpoint"] ||
                [contentType isEqualToString:@"application/powerpoint"] ||
                [contentType isEqualToString:@"application/vnd.ms-powerpoint"] ||
                [contentType isEqualToString:@"application/x-mspowerpoint"]) {
        return kDocumentTypePPT;
    } else if ( [contentType isEqualToString:@"application/msword"] ) {
        return kDocumentTypeDOC;
    } else if ( [contentType isEqualToString:@"application/excel"] ||
                [contentType isEqualToString:@"application/vnd.ms-excel"] ||
                [contentType isEqualToString:@"application/x-excel"] ||
                [contentType isEqualToString:@"application/x-msexcel"] ) {
        return kDocumentTypeXLS;
    }  else if ( [contentType isEqualToString:@"application/vnd.openxmlformats-officedocument.wordprocessingml.document"] ) {
        return kDocumentTypeDOCX;
    }  else if ( [contentType isEqualToString:@"application/vnd.openxmlformats-officedocument.presentationml.presentation"] ) {
        return kDocumentTypePPTX;
    }  else if ( [contentType isEqualToString:@"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"] ) {
        return kDocumentTypeXLSX;
    }   else if ( [contentType rangeOfString:@"image/"].location != NSNotFound ) {
        return kDocumentTypeIMG;
    } else {
        return kDocumentTypeUndefined;
    }

}