Extract specific parts of PDF documents [closed]

2019-05-11 06:31发布

i have multiple (30) PDF files, each containig 48-96 pages. The layout of all pages is identical, there are just other contents (numbers, graphs).

Background: These pages are PDF Reports of fibre cable measurements, and I have to sort them by attenuation of the cables. Due to confidential issues, I unfortunatly cannot give an example file.

For verifying these reports, we are doing some control samples, thats why i need the reports sorted. The question now is: How can I export only very specific parts of all pages in all pdf files to some format i can sort?

As already mentioned, it is very specific where the values are located on the page. It is also already "parsed" content, so it is available "as text" in the PDF file, so it is not scanned, no OCR required.

Any help is appreciated. I currently have no idea how to solve that issue, it could be some tool which does something like that, or a programming approach to solve that.

标签： excel pdf converter

1条回答

冷血范

2楼-- · 2019-05-11 07:00

As you indicate in your comments to the original question, you are prepared to program a solution. I would propose using Java and the iText PDF library. It enables you to extract text from documents as long as the text actually is extractable (you actually can put glyphs into a PDF but drop the mappings from glyphs to characters).

You can find sample code for PDF text extraction with iText in the ExtractPageContent* samples for chapter 15 of iText in Action — 2nd Edition. Especially ExtractPageContentArea is of interest in your case.

Essentially you only have to take that sample and generalize it too extract the text from multiple areas on the page.

0人赞添加讨论(0) 举报

Extract specific parts of PDF documents [closed]

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间