All the documentation I can find seems to suggest I can only extract the entire file's content. But I need to extract pages individually. Do I need to write my own parser for that? Is there some obvious method that I am missing?
相关问题
- React Native Inline style for multiple Text in sin
- How to change the first two uppercase characters o
- Is there a way to rotate text around (or inside) a
- QML creating Text element takes long time
- How to parse unstructured table-like data?
相关文章
- 放在input的text下文本一直出现一个/(即使还没输入任何值)是什么情况
- Rendering plain text through PHP
- Python thinks a 3000-line text file is one line lo
- How to get all distinct words of a specified minim
- saving cProfile results to readable external file
- How do I give focus to a python Tkinter text widge
- Convert one row to multiple rows per subject in a
- Convert Text to Table (Space Delimited or Fixed le
Actually Tika does handle pages (at least in pdf) by sending elements
<div><p>
before page starts and</p></div>
after page ends. You can easily setup page count in your handler using this (just counting pages using only<p>
):When doing this with pdf you may run into the problem when parser doesn't send text lines in proper order - see Extracting text from PDF files with Apache Tika 0.9 (and PDFBox under the hood) on how to handle this.
You'll need to work with the underlying libraries - Tika doesn't do anything at the page level.
For PDF files, PDFBox should be able to give you some page stuff. For Word, HWPF and XWPF from Apache POI don't really do page level things - the page breaks aren't stored in the file, but instead need to be calculated on the fly based on the text + fonts + page size...
You can get the number of pages in a Pdf using the metadata object's
xmpTPg:NPages
key as in the following: