Parse a PDF document with ruby

2019-02-22 03:10发布

I have multiple PDF documents in a folder that have a certain structure:

enter image description here

Now I want to be able to parse the information from the PDF. Please note that the paragraphs have varying lengths.

Obviously I am not asking you to solve the problem for me, but I do need some pointers as to how this can be achieved.

I have used nokogiri before and technically I need something like that but for PDFs.

So the pseudo result for my example would look like this:

- ItemA
  - Title: ItemA
  - File: 123456789.pdf
  - Image: ImageA.png (the image was stored on disk)
  - Subtitle1: Content for subtitle 1
  - Subtitle2: Content for subtitle 2
  - Subtitle3: Content for subtitle 3
- TitleB
  - [...]

标签： ruby parsing pdf scripting ocr

2条回答

贼婆χ

2楼-- · 2019-02-22 03:45

pdf-reader is one of the solution. But it has issues sometimes it doesn't give text in proper format. I have used it.

I will suggest to use docsplit . You will find more information about 'pdf-reader' and 'docsplit' in this blog post.

Hope this helps. In case any clarification is required, feel free to comment.

0人赞添加讨论(0) 举报

傲

3楼-- · 2019-02-22 03:53

Getting the text

The text can easily be parsed like so:

# gem install pdf-reader
require 'pdf-reader'

reader = PDF::Reader.new('my.pdf')

reader.pages.each do |page|
  puts page.text
end

Saving the image

This can be done with the same library. See the example script examples/extract_images.rb.

However

This is (not yet) a complete answer. The next steps would now be to:

Parse the text and look for the headings
Crop the image, which can be achieved with a library like RMagick or Mini Magick.

0人赞添加讨论(0) 举报

Parse a PDF document with ruby

Getting the text

Saving the image

However

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间