extract folders from portfolio pdf java

2020-03-26 12:03发布

问题:

I have a portfolio pdf with folders,subfolders and files. I need to extract the same structure as it is with folders,subfolders and files using iText in java. I am getting only files with EMBEDEDFILES. what is way of fetch folders also.

Kindly find code that i am using. This code only give me files present inside the folders.

public static void extractAttachments(String src, String dir) throws         IOException
{
    File folder = new File(dir);
    folder.mkdirs();

    PdfReader reader = new PdfReader(src);

    PdfDictionary root = reader.getCatalog();

    PdfDictionary names = root.getAsDict(PdfName.NAMES);
    System.out.println(""+names.getKeys().toString());
    PdfDictionary embedded = names.getAsDict(PdfName.EMBEDDEDFILES);
    System.out.println(""+embedded.toString());

    PdfArray filespecs = embedded.getAsArray(PdfName.NAMES);

    System.out.println(filespecs.getAsString(root1));
    for (int i = 0; i < filespecs.size();)
    {
        extractAttachment(reader, folder, filespecs.getAsString(i++),
                filespecs.getAsDict(i++));
    }
}

protected static void extractAttachment(PdfReader reader, File dir, PdfString name, PdfDictionary filespec)
        throws IOException
{
    PRStream stream;
    FileOutputStream fos;
    String filename;
    PdfArray parent;
    PdfDictionary refs = filespec.getAsDict(PdfName.EF);
    //System.out.println(""+refs.getKeys().toString());

    for (Object key : refs.getKeys())
    {
        stream = (PRStream)         PdfReader.getPdfObject(refs.getAsIndirectObject((PdfName) key));

        filename = filespec.getAsString((PdfName) key).toString();

        // System.out.println("" + filename);
        fos = new FileOutputStream(new File(dir, filename));
        fos.write(PdfReader.getStreamBytes(stream));
        fos.flush();
        fos.close();
    }
}

回答1:

The folder structure the OP tries to replicate while extracting portfolio files is specified in the Adobe® Supplement to the ISO 32000, BaseVersion: 1.7, ExtensionLevel: 3. Thus, it is not part of the current PDF standard and, therefore, PDF processing software is not required to understand this kind of information. It looks like being scheduled for addition to the upcoming PDF-2 (ISO 32000-2) standard, though.

To extract portfolio files into the associated folder structure, therefore, we have to retrieve the folder information as specified in the Adobe® Supplement:

Beginning with extension level 3, a portable collection can contain a Folders object for the purpose of organizing files into a hierarchical structure. The structure is represented by a tree with a single root folder acting as the common ancestor for all other folders and files in the collection. The single root folder is referenced in the Folders entry of Table 8.6 on page 29.

Table 8.6c describes the entries in a folder dictionary

  • ID integer (Required; ExtensionLevel 3) A non-negative integer value representing the unique folder identification number. Two folders shall not share the same ID value.

    The folder ID value appears as part of the name tree key of any file associated with this folder. A detailed description of the association between folder and files can be found after this table.

  • Name text string (Required; ExtensionLevel 3) A file name representing the name of the folder. Two sibling folders shall not share the same name following case normalization.

  • Child dictionary (Required if the folder has any descendents; ExtensionLevel 3) An indirect reference to the first child folder of this folder.

  • Next dictionary (Required for all but the last item at each level; ExtensionLevel 3) An indirect reference to the next sibling folder at this level.

(section 8.2.4 Collections)

E.g. like this:

static Map<Integer, File> retrieveFolders(PdfReader reader, File baseDir) throws DocumentException
{
    Map<Integer, File> result = new HashMap<Integer, File>();

    PdfDictionary root = reader.getCatalog();
    PdfDictionary collection = root.getAsDict(PdfName.COLLECTION);
    if (collection == null)
        throw new DocumentException("Document has no Collection dictionary");
    PdfDictionary folders = collection.getAsDict(FOLDERS);
    if (folders == null)
        throw new DocumentException("Document collection has no folders dictionary");

    collectFolders(result, folders, baseDir);

    return result;
}

static void collectFolders(Map<Integer, File> collection, PdfDictionary folder, File baseDir)
{
    PdfString name = folder.getAsString(PdfName.NAME);
    File folderDir = new File(baseDir, name.toString());
    folderDir.mkdirs();
    PdfNumber id = folder.getAsNumber(PdfName.ID);
    collection.put(id.intValue(), folderDir);

    PdfDictionary next = folder.getAsDict(PdfName.NEXT);
    if (next != null)
        collectFolders(collection, next, baseDir);
    PdfDictionary child = folder.getAsDict(CHILD);
    if (child != null)
        collectFolders(collection, child, folderDir);
}

final static PdfName FOLDERS = new PdfName("Folders");
final static PdfName CHILD = new PdfName("Child");

(excerpt from PortfolioFileExtraction.java)

and use these retrieved folder information when writing the files.

The association of files and folders is specified in the Adobe® Supplement like this:

As previously mentioned, files in the EmbeddedFiles name tree are associated with folders by a special naming convention applied to the name tree key strings. Strings that conform to the following rules serve to associate the corresponding file with a folder:

  • The name tree keys are PDF text strings.
  • The first character, excluding any byte order marker, is U+003C, the LESS-THAN SIGN (<).
  • The following characters shall one or more digits (0 to 9) followed by the closing U+003E, the GREATER-THAN SIGN (>)
  • The remainder of the string is a file name.

The section of the string enclosed by LESS-THAN SIGN GREATER-THAN SIGN(<>) is interpreted as a numeric value that specifies the ID value of the folder with which the file is associated. The value shall correspond to a folder ID. The section of the string following the folder ID tag represents the file name of the embedded file.

Files in the EmbeddedFiles name tree that do not conform to these rules shall be treated as associated with the root folder.

(section 8.2.4 Collections)

Your methods can be extended to do so like this:

public static void extractAttachmentsWithFolders(PdfReader reader, String dir) throws IOException, DocumentException
{
    File folder = new File(dir);
    folder.mkdirs();

    Map<Integer, File> folders = retrieveFolders(reader, folder);

    PdfDictionary root = reader.getCatalog();

    PdfDictionary names = root.getAsDict(PdfName.NAMES);
    System.out.println("" + names.getKeys().toString());
    PdfDictionary embedded = names.getAsDict(PdfName.EMBEDDEDFILES);
    System.out.println("" + embedded.toString());

    PdfArray filespecs = embedded.getAsArray(PdfName.NAMES);

    for (int i = 0; i < filespecs.size();)
    {
        extractAttachment(reader, folders, folder, filespecs.getAsString(i++), filespecs.getAsDict(i++));
    }
}

protected static void extractAttachment(PdfReader reader, Map<Integer, File> dirs, File dir, PdfString name, PdfDictionary filespec) throws IOException
{
    PRStream stream;
    FileOutputStream fos;
    String filename;
    PdfDictionary refs = filespec.getAsDict(PdfName.EF);

    File dirHere = dir;
    String nameString = name.toUnicodeString();
    if (nameString.startsWith("<"))
    {
        int closing = nameString.indexOf('>');
        if (closing > 0)
        {
            int folderId = Integer.parseInt(nameString.substring(1, closing));
            File folderFile = dirs.get(folderId);
            if (folderFile != null)
                dirHere = folderFile;
        }
    }

    for (PdfName key : refs.getKeys())
    {
        stream = (PRStream) PdfReader.getPdfObject(refs.getAsIndirectObject(key));

        filename = filespec.getAsString(key).toString();

        fos = new FileOutputStream(new File(dirHere, filename));
        fos.write(PdfReader.getStreamBytes(stream));
        fos.flush();
        fos.close();
    }
}

(excerpt from PortfolioFileExtraction.java)

Applying these methods to your sample PDF (e.g. using the test method testSamplePortfolio11Folders in PortfolioFileExtraction.java) one gets

Root
│   ThumbImpression.pdf
│
├───Folder 1
│   │   EStampPdf.pdf
│   │   Presentation.pdf
│   │
│   ├───Folder 11
│   │   │   Test.pdf
│   │   │
│   │   └───Folder 111
│   └───Folder 12
└───Folder 2
        SealDeed.pdf