PDF table extraction

2019-02-08 17:53发布

I have (same) data saved as a GIF image file and as a PDF file and I want to parse it to HTML or XML. The data is actually the menu for my university's cafeteria. That means that there is a new version of the file that has to be parsed each week! In General, the files contain some header and footer text, as well as a table full of other data in between. I have read some posts on stackoverflow and I also had started some attempts to parse out the table data as HTML/XML:

PDF

  • PDFBox || iText (Java)
  • Google Docs Import
  • PDF2HTML || PDF2Table

GIF

  • Tesseract-OCR

I have got the best result from parsing the PDF-file with PDFBox, but still (as the menu changes weekly), it is not reliable enough. The HTML that I receive includes sometimes more, sometimes less "paragraphs" (<p>), so that I am not able to parse the data precice enough.

That is why I would like to know if there is an other way to do it?

7条回答
欢心
2楼-- · 2019-02-08 18:35

I have tried many of the OCR and text converter software's and though I believe once should write the program self converting PDF to text as the Image is better understood by the person performing task.

I had also tried to use Google and many other Online (about 900 website) and Offline(about 1000 softwares) products by different companies. If you want to extract text from any method such as OCR or Text from PDF, then most accurate program I found is PDFTOHTML. The accuracy rate of PDFTOHTML is about 98% and Google Online has about 94% accuracy. It is a very good software which also provide you the correct format of text i.e. bold, italic etc of the text.

查看更多
何必那么认真
3楼-- · 2019-02-08 18:39

I have implemented my own algorithm ( its name is traprange ) to parse tabular data in pdf files.

Following are some sample pdf files and results:

  1. Input file: sample-1.pdf, result: sample-1.html
  2. Input file: sample-4.pdf, result: sample-4.html

Visit my project page at traprange

or my article at traprange

查看更多
时光不老,我们不散
4楼-- · 2019-02-08 18:43

Are the tables in the same place each time? If you can find the dimentions of each box, you could use a tool to split the PDF into multiple documents, each of which contain one box, after which you can use whatever tool you want to convert each smaller PDF to HTML (such as the tools mentioned in other answers). Random Google searches pulled up PyPdf, which looked like it might have some useful functions.

If you aren't able to hard code the size of the box (or want to apply the problem to multiple menus in different formats), the obvious method to me (I said obvious, not easy) would be edge detection to find where the border of the table would be, and then apply the splitting I talked about before.

查看更多
Lonely孤独者°
5楼-- · 2019-02-08 18:50

If you are looking to extract data from tables once a week and you are on Windows then, please check this freeware pdf utility that includes automated table detection and table to CSV, XML conversion: PDF Viewer utility.

The utility is free for both commercial and non-commercial usage for non-developers (and there is the separate version for developers who want to automate via API).

Disclaimer: I work for ByteScout

查看更多
仙女界的扛把子
6楼-- · 2019-02-08 18:54

You can use Camelot to extract tables from your PDF and export it to an HTML file. CSV, Excel and JSON are also supported. You can check out the documentation at: http://camelot-py.readthedocs.io. It gives more accurate results as compared to other open-source table extraction tools and libraries. Here's a comparison.

You can use the following code snippet to go forward with your task:

>>> import camelot
>>> tables = camelot.read_pdf('file.pdf')
>>> type(tables[0].df)
<class 'pandas.core.frame.DataFrame'>
>>> tables[0].to_html('file.html')

Disclaimer: I'm the author of the library.

查看更多
冷血范
7楼-- · 2019-02-08 18:57

Tabula is a pretty good start on a JRuby web interface for extracting CSV/TSV tables from arbitrary PDFs.

查看更多
登录 后发表回答