PDF table extraction

I have (same) data saved as a GIF image file and as a PDF file and I want to parse it to HTML or XML. The data is actually the menu for my university's cafeteria. That means that there is a new version of the file that has to be parsed each week! In General, the files contain some header and footer text, as well as a table full of other data in between. I have read some posts on stackoverflow and I also had started some attempts to parse out the table data as HTML/XML:

PDF

PDFBox || iText (Java)
Google Docs Import
PDF2HTML || PDF2Table

GIF

Tesseract-OCR

I have got the best result from parsing the PDF-file with PDFBox, but still (as the menu changes weekly), it is not reliable enough. The HTML that I receive includes sometimes more, sometimes less "paragraphs" (<p>), so that I am not able to parse the data precice enough.

That is why I would like to know if there is an other way to do it?

标签： pdf pdfbox extraction

7条回答

欢心

2楼-- · 2019-02-08 18:35

I have tried many of the OCR and text converter software's and though I believe once should write the program self converting PDF to text as the Image is better understood by the person performing task.

I had also tried to use Google and many other Online (about 900 website) and Offline(about 1000 softwares) products by different companies. If you want to extract text from any method such as OCR or Text from PDF, then most accurate program I found is PDFTOHTML. The accuracy rate of PDFTOHTML is about 98% and Google Online has about 94% accuracy. It is a very good software which also provide you the correct format of text i.e. bold, italic etc of the text.

0人赞添加讨论(0) 举报

何必那么认真

3楼-- · 2019-02-08 18:39

I have implemented my own algorithm ( its name is traprange ) to parse tabular data in pdf files.

Following are some sample pdf files and results:

Input file: sample-1.pdf, result: sample-1.html
Input file: sample-4.pdf, result: sample-4.html

Visit my project page at traprange

or my article at traprange

0人赞添加讨论(0) 举报

时光不老，我们不散

4楼-- · 2019-02-08 18:43

Are the tables in the same place each time? If you can find the dimentions of each box, you could use a tool to split the PDF into multiple documents, each of which contain one box, after which you can use whatever tool you want to convert each smaller PDF to HTML (such as the tools mentioned in other answers). Random Google searches pulled up PyPdf, which looked like it might have some useful functions.

If you aren't able to hard code the size of the box (or want to apply the problem to multiple menus in different formats), the obvious method to me (I said obvious, not easy) would be edge detection to find where the border of the table would be, and then apply the splitting I talked about before.

0人赞添加讨论(0) 举报

Lonely孤独者°

5楼-- · 2019-02-08 18:50

If you are looking to extract data from tables once a week and you are on Windows then, please check this freeware pdf utility that includes automated table detection and table to CSV, XML conversion: PDF Viewer utility.

The utility is free for both commercial and non-commercial usage for non-developers (and there is the separate version for developers who want to automate via API).

Disclaimer: I work for ByteScout

0人赞添加讨论(0) 举报

仙女界的扛把子

6楼-- · 2019-02-08 18:54

You can use Camelot to extract tables from your PDF and export it to an HTML file. CSV, Excel and JSON are also supported. You can check out the documentation at: http://camelot-py.readthedocs.io. It gives more accurate results as compared to other open-source table extraction tools and libraries. Here's a comparison.

You can use the following code snippet to go forward with your task:

>>> import camelot
>>> tables = camelot.read_pdf('file.pdf')
>>> type(tables[0].df)
<class 'pandas.core.frame.DataFrame'>
>>> tables[0].to_html('file.html')

Disclaimer: I'm the author of the library.

0人赞添加讨论(0) 举报

冷血范

7楼-- · 2019-02-08 18:57

Tabula is a pretty good start on a JRuby web interface for extracting CSV/TSV tables from arbitrary PDFs.

0人赞添加讨论(0) 举报

1 2 下一页

PDF table extraction

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间