Extract Images and Words with coordinates and size

I've read much about PDF extractions and libraries (as iText) but i just haven't found a solution to extract images and text (with coordinates) from a PDF.

The task is to scan PDF with catalog of products and extract each image. There is an image code printed next to each image and also a list of product codes for products that are shown on the image.

I know that there is no way to extract structured info from a PDF like this but with coordinates of all image and text objects I could write code to identify linked text by its distance from the image. Then I could split text using a RegExp and find out what is a product code, what is an image code etc.

Could you recommend a good and working solution for the task?

标签： image pdf coordinates extraction words

3条回答

时光不老，我们不散

2楼-- · 2019-02-09 17:12

If a commercial library is an option for you, you could try Amyuni PDF Creator .Net or Amyuni PDF Creator ActiveX. You could use the method IacDocument.GetObjectsInRectangle to retrieve all the "graphic objects" of your interest, then use the ObjectType attribute to separate images from text. The library already provides an algorithm for putting close text together. From the documentation:

IacDocument.GetObjectsInRectangle Method

The GetObjectsInRectangle method gets all the objects that are in the specified rectangle.

Usual disclaimer applies.

0人赞添加讨论(0) 举报

Melony?

3楼-- · 2019-02-09 17:18

Several Java libraries can do this. Have you looked at JPedal or PdfBox?

0人赞添加讨论(0) 举报

甜甜的少女心

4楼-- · 2019-02-09 17:28

Use XPDF (http://www.foolabs.com/xpdf/)

It can extract all the characters in the PDF with co-ordinates (pdftotext -bbox [sourcefile] [outputfile]) and also all the images and SVGs in the PDF.

It's open source (GPLv2) and supports a lot of additional extraction functionalities as well.

0人赞添加讨论(0) 举报

Extract Images and Words with coordinates and size

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间