Render PDF as image and extracting hyperlinks

I use imagemagick to render a PDF (generated by pdfLaTex) as an image:

convert -density 120 test.pdf -trim test.png

Then I use this image in an HTML file (in order to include latex code in an own wiki engine).

But of course, the PNG file doesn't have any hyperlink the PDF file contains.

Is there any possibility to extract the coordinates and target URLs of the hyperlinks too, so I can build a HTML image map?

If it makes a difference: I only need external (http://) hyperlinks, no PDF-internal hyperlinks. A text-based solution like pdftohtml would be unacceptable, since the PDFs contain graphics and formulars too.

标签： html pdf hyperlink imagemagick

2条回答

beautiful°

2楼-- · 2020-03-26 05:26

Colleague of mine found a nice lib, PDFMiner, which includes a tools/dumppdf.py which does pretty much, what I need, see http://www.unixuser.org/~euske/python/pdfminer/

There's also another SO question that has an answer for this one, see Looking for a linux PDF library to extract annotations and images from a PDF Apparently pdfreader for Ruby does this too https://github.com/yob/pdf-reader

0人赞添加讨论(0) 举报

祖国的老花朵

3楼-- · 2020-03-26 05:30

Imagemagick uses Ghostscript to render the PDF file to an image. You could also use Ghostscript to extract the Link annotations. In fact the PDF interpreter already does this for the benefit of the pdfwrite device, so that it can produce PDF files with the same hyperlinks as the original.

You would need to do a small amount of PostScript programming, let me know if you want some more details.

In gs/Resource/Init the file pdf_main.ps contains large parts of the PDF interpreter. In there you will find this:

  /Link {
    mark exch
    dup /BS knownoget { << exch { oforce } forall >> /BS exch 3 -1 roll } if
    dup /F knownoget { /F exch 3 -1 roll } if
    dup /C knownoget { /Color exch 3 -1 roll } if
    dup /Rect knownoget { /Rect exch 3 -1 roll } if
    dup /Border knownoget {
....
    } if
    { linkdest } stopped

That code processes Link annotations (the hyperlinks in the PDF file). You could replace the 'linkdest' with PostScript code to write the data to a file instead, which would give you the hyperlinks. Note that you would also need to set -dDOPDFMARKS on the command line, as this kind of processing is usually disabled for rendering devices, which can't make use of it.

0人赞添加讨论(0) 举报

Render PDF as image and extracting hyperlinks

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间