How to extract page number from PDF file [closed]

2019-07-15 10:50发布

We explored so many API's like tika,Pdfbox and itextpdf to extract page number from pdf file but we did not able to do this. In itextpdf we got PdfPageLabels.getPageLabels(reader) but the behaviour of this method is not uniform.

1条回答
爱情/是我丢掉的垃圾
2楼-- · 2019-07-15 11:31

The reason why you don't find any software that is able to extract page numbers from a PDF is simple: the concept of a page number doesn't exist in PDF.

Allow me to predict your response.

*"Wait a minute!" you say, "When I open a PDF in Adobe Reader, I can clearly see a page number in the document!"

Well yes, you can see that page number with your eyes and your human intelligence, but to a machine that number is just some text drawn on a canvas. A machine consuming the document has no idea what all the glyphs and lines and shapes on a page are about. Hence, software can not give you the page number you see as a human. A machine doesn't know where to look!

If you know something about PDF, I can predict your next reply.

"Wait a minute!" you say, "What about Tagged PDF? Doesn't Tagged PDF mean that the semantics of a document are stored along with the representation?"

Well yes, when a PDF is tagged a snippet of text knows that is is part of a title, or a paragraph, or a list,... But Tagged PDF is there to define the structure of the real content. Page numbers however, are not part of the real content. They are marked as artifacts along with headers, footers and other items on a page that are not considered being real content. There is no way to distinguish page numbers.

"Then what are these page labels about?" you ask.

Well, page labels are optional. They are present in some PDFs that are well conceived, but they will be absent in a large majority of the PDFs you'll find in the wild.

This is the long answer. The short answer is simple: You are asking for something that is impossible (in general, not only with iText, Tika, PdfBox, or any other tool you might try).

查看更多
登录 后发表回答