We explored so many API's like tika,Pdfbox and itextpdf to extract page number from pdf file but we did not able to do this. In itextpdf we got PdfPageLabels.getPageLabels(reader) but the behaviour of this method is not uniform.
相关问题
- Delete Messages from a Topic in Apache Kafka
- Jackson Deserialization not calling deserialize on
- How to maintain order of key-value in DataFrame sa
- StackExchange API - Deserialize Date in JSON Respo
- Difference between Types.INTEGER and Types.NULL in
The reason why you don't find any software that is able to extract page numbers from a PDF is simple: the concept of a page number doesn't exist in PDF.
Allow me to predict your response.
*"Wait a minute!" you say, "When I open a PDF in Adobe Reader, I can clearly see a page number in the document!"
Well yes, you can see that page number with your eyes and your human intelligence, but to a machine that number is just some text drawn on a canvas. A machine consuming the document has no idea what all the glyphs and lines and shapes on a page are about. Hence, software can not give you the page number you see as a human. A machine doesn't know where to look!
If you know something about PDF, I can predict your next reply.
"Wait a minute!" you say, "What about Tagged PDF? Doesn't Tagged PDF mean that the semantics of a document are stored along with the representation?"
Well yes, when a PDF is tagged a snippet of text knows that is is part of a title, or a paragraph, or a list,... But Tagged PDF is there to define the structure of the real content. Page numbers however, are not part of the real content. They are marked as artifacts along with headers, footers and other items on a page that are not considered being real content. There is no way to distinguish page numbers.
"Then what are these page labels about?" you ask.
Well, page labels are optional. They are present in some PDFs that are well conceived, but they will be absent in a large majority of the PDFs you'll find in the wild.
This is the long answer. The short answer is simple: You are asking for something that is impossible (in general, not only with iText, Tika, PdfBox, or any other tool you might try).