The page numbers in a PDF come in different variations, some PDFs have initial pages as roman numbers like I, ii, and later the page numbers are 1,2,... . I found a function in the pdfbox
to get the desired page page.get(pagenumber)
. But the problem with this function is that when I write get(1)
, it returns the first page of the document (which may be numbered as ii and not the page with page number 2). Is there any way to obtain the page whose page number in the PDF is say 2 and not the second page overall?
问题:
回答1:
Section 12.4.2 Page Labels in the PDF specification ISO 32000-1:2008 explains how the page labels (the special page numbers you want to understand) are defined in a document:
Each page in a PDF document shall be identified by an integer page index that expresses the page’s relative position within the document. In addition, a document may optionally define page labels (PDF 1.3) to identify each page visually on the screen or in print. Page labels and page indices need not coincide: the indices shall be fixed, running consecutively through the document starting from 0 for the first page, but the labels may be specified in any way that is appropriate for the particular document.
For purposes of page labelling, a document shall be divided into labelling ranges, each of which is a series of consecutive pages using the same numbering system. Pages within a range shall be numbered sequentially in ascending order. A page’s label consists of a numeric portion based on its position within its labelling range, optionally preceded by a label prefix denoting the range itself.
A document’s labelling ranges shall be defined by the PageLabels entry in the document catalogue (see 7.7.2, “Document Catalog”). The value of this entry shall be a number tree (7.9.7, “Number Trees”), each of whose keys is the page index of the first page in a labelling range. The corresponding value shall be a page label dictionary defining the labelling characteristics for the pages in that range. The tree shall include a value for page index 0. Table 159 shows the contents of a page label dictionary.
For more details and an example cf. the specification itself.
Using low-level PDFBox methods it should be easy to extract the PageLabels entry in the document catalogue and retrieve the labeling details
回答2:
Although the title mentions PDFBox, you're also adding the label itext, so let me show you how to extract PageLabels using iText:
PdfReader reader = new PdfReader(src);
String[] labels = PdfPageLabels.getPageLabels(reader);
Now you have a String
array where you could have:
labels[0] = "i";
labels[1] = "ii";
labels[2] = "iii";
labels[3] = "iv";
labels[4] = "1";
labels[5] = "2";
labels[6] = "3";
and so on...
Now you can put these values in a HashMap
together with index + 1
as the page number if you want to know which physical page corresponds with the page label "2"
.