PDF Parsing with Text and Coordinates

2019-03-13 14:34发布

问题:

I am currently using PDF Box to parse a pdf and I am trying to figure out how to retrieve data about the text such as the font (bold, size, etc) and the location of the font.

Any suggestions?

回答1:

After poking around the (hard to find) PDFBox docs, I found this little gem.

Apparently one of the examples shows exactly how to do everything you asked. Basically, you subclass PdfTextStripper and override the processTextPosition method. There, you query the TextPosition for whatever information you need.

For future reference, you can find the javaDoc here: http://pdfbox.apache.org/apidocs/index.html

Edit 2018-04-02: original link is dead, but example can be found in the SVN repo here.



回答2:

One of the best things for text extraction from PDFs is TET, the text extraction toolkit. TET is part of the PDFlib.com family of products.

PDFlib.com is Thomas Merz's (the author of the "PostScript and PDF Bible") company.

TET's first incarnation is a library. That one can probably do everything you want, including to positional information about each text element on the page. Oh, and it can also extract images. It recombines+merges images which are fragmented into pieces.

pdflib.com also offers another incarnation of this technology, the TET plugin for Acrobat. Obviously you'd need Acrobat as well to make use of this.

And the third incarnation is the PDFlib TET iFilter. This is a standalone tool for user workstations. Both this is free (as in beer) to use for private, non-commercial purposes.

Lastly, TET also comes with a commandline interface.

TET is really powerful. Way better than Adobe's own text extraction. It extracted text for me where other tools (including Adobe's) do spit out garbage only.

A few months ago I tested their desktop standalone tool, and what they say on their webpage is true. It has a very good commandline. Some of my "problematic" PDF test files the tool handled to my full satisfaction.

This thing is my recommendation for every sophisticated and challenging PDF text extraction requirements.

TET is simply awesome. It detects tables. Inside tables, it identifies cells spanning multiple columns. It identifies table rows and contents of each table cell separately. It deals very well with hyphenations: it removes hyphens and restores complete words. It supports non-ASCII languages (including CJK, Arabic and Hebrew). When encountering ligatures, it restores the original characters...

Give it a try.



回答3:

The GetPageText function with extract option 3 or 4 in Quick PDF Library returns a CSV string for the selected page which includes the text (either individual words or a piece of text) and the related font name, text color, text size and co-ordinates on the page.

Note: it is a commercial library and I work for the company that sells it.



回答4:

PDF files can be parsed with tabula-py, or tabula-java.

I made a full tutorial on how to use tabula-py on this article. You can tabula in a web-browser too as long as you have installed Java.