I have multiple PDF documents in a folder that have a certain structure:
Now I want to be able to parse the information from the PDF. Please note that the paragraphs have varying lengths.
Obviously I am not asking you to solve the problem for me, but I do need some pointers as to how this can be achieved.
I have used nokogiri before and technically I need something like that but for PDFs.
So the pseudo result for my example would look like this:
- ItemA
- Title: ItemA
- File: 123456789.pdf
- Image: ImageA.png (the image was stored on disk)
- Subtitle1: Content for subtitle 1
- Subtitle2: Content for subtitle 2
- Subtitle3: Content for subtitle 3
- TitleB
- [...]
pdf-reader
is one of the solution. But it has issues sometimes it doesn't give text in proper format. I have used it.
I will suggest to use docsplit . You will find more information about 'pdf-reader' and 'docsplit' in this blog post.
Hope this helps. In case any clarification is required, feel free to comment.
Getting the text
The text can easily be parsed like so:
# gem install pdf-reader
require 'pdf-reader'
reader = PDF::Reader.new('my.pdf')
reader.pages.each do |page|
puts page.text
end
Saving the image
This can be done with the same library. See the example script examples/extract_images.rb.
However
This is (not yet) a complete answer.
The next steps would now be to:
- Parse the text and look for the headings
- Crop the image, which can be achieved with a library like RMagick or Mini Magick.