i have multiple (30) PDF files, each containig 48-96 pages. The layout of all pages is identical, there are just other contents (numbers, graphs).
Background: These pages are PDF Reports of fibre cable measurements, and I have to sort them by attenuation of the cables. Due to confidential issues, I unfortunatly cannot give an example file.
For verifying these reports, we are doing some control samples, thats why i need the reports sorted. The question now is: How can I export only very specific parts of all pages in all pdf files to some format i can sort?
As already mentioned, it is very specific where the values are located on the page. It is also already "parsed" content, so it is available "as text" in the PDF file, so it is not scanned, no OCR required.
Any help is appreciated. I currently have no idea how to solve that issue, it could be some tool which does something like that, or a programming approach to solve that.
As you indicate in your comments to the original question, you are prepared to program a solution. I would propose using Java and the iText PDF library. It enables you to extract text from documents as long as the text actually is extractable (you actually can put glyphs into a PDF but drop the mappings from glyphs to characters).
You can find sample code for PDF text extraction with iText in the ExtractPageContent* samples for chapter 15 of iText in Action — 2nd Edition. Especially ExtractPageContentArea is of interest in your case.
Essentially you only have to take that sample and generalize it too extract the text from multiple areas on the page.