How to scrape tables in thousands of PDF files?

2019-02-03 08:43发布

I have about 1'500 PDFs consisting of only 1 page each, and exhibiting the same structure (see http://files.newsnetz.ch/extern/interactive/downloads/BAG_15m_kzh_2012_de.pdf for an example).

What I am looking for is a way to iterate over all these files (locally, if possible) and extract the actual contents of the table (as CSV, stored into a SQLite DB, whatever).

I would love to do this in Node.js, but couldn't find any suitable libraries for parsing such stuff. Do you know of any?

If not possible in Node.js, I could also code it in Python, if there are better methods available.

标签： python node.js parsing pdf scraper

1条回答

放荡不羁爱自由

2楼-- · 2019-02-03 09:43

I didn't know this before, but less has this magical ability to read pdf files. I was able to extract the table data from your example pdf with this script:

import subprocess
import re

output = subprocess.check_output(["less","BAG_15m_kzh_2012_de.pdf"])

re_data_prefix = re.compile("^[0-9]+[.].*$")
re_data_fields = re.compile("(([^ ]+[ ]?)+)")
for line in output.splitlines():
    if re_data_prefix.match(line):
        print [l[0].strip() for l in re_data_fields.findall(line)]

0人赞添加讨论(0) 举报

How to scrape tables in thousands of PDF files?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间