How to extract charts/tables/graphs from PDF files

2019-08-10 02:36发布

站内文章 / Python

41 0

爷、活的狠高调

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

Searched quite a bit but as I couldn't find a solution for this kind of problem, hence posting a clear question on the same. Most answers cover image/text extraction which are comparatively easier.

I've a requirement of extracting tables and graphs as text (csv) and images respectively from PDFs.

Can anyone help me with an efficient python 3.6 code to solve the same?

Till now I could achieve extracting jpgs using startmark = b"\xff\xd8" and endmark = b"\xff\xd9", but not all tables and graphs in a PDF are plain jpgs, hence my code fails badly in achieving that.

Example, I want to extract table from page 11 and graphs from page 12 as image or something which is feasible from the below given link. How to go about it?

https://hartmannazurecdn.azureedge.net/media/2369/annual-report-2017.pdf