Add links in PDF [closed]

2020-06-06 07:27发布

I have several PDFs that were generated with Microsoft Word. I want to:

  1. Use a regex to find matches in the PDF text.
  2. Convert the matching text to a link that points to an external URL.
  3. Save the new version of the PDF.

If I were doing this in HTML, it would look like this:

<!-- before: -->
This is the text to match.

<!-- after: -->
This is the text to <a href="http://www.match.com/" target="_blank">match</a>.

How can I do this to a PDF?

I'd prefer Python, but I'm open to alternatives.

Edit: I don't have access to the original Word documents. I need to manipulate the PDFs themselves. I'm looking for a technique using a Python PDF library (or something similar in another language).

Edit 2: I understand that the source code of a PDF doesn't contain literal strings. I'm wondering if there's an approach that could do something like: (1) extract the text, (2) find matches, and (3) for each match, draw a clickable box around the position of the text in the original PDF. The closest I've come is PyPDF2's addLink(), but that adds internal links in the PDF, not links to external URLs.

3条回答
▲ chillily
2楼-- · 2020-06-06 07:57

I have solved this. Appreciate anyone cleaning up any errors. https://github.com/JohnMulligan/PyPDF2/tree/URI-linking

Because Kurt answered most of parts 1 and 2, I'm going to restrict my answer to part 3 of the original question: how to add external links to a PDF. (I have a fully working answer to 1 & 2, but it's inelegant. If people want it, I'll post that, too.)

My branch of PyPDF2 has addURI functionality, that works in the same way as the package's original addLink().

Specifically: With a rectangles dictionary that has has pagenumber keys:

rectangles_dictionary = {0:{'key1':[255, 156, 277, 171],'key2':[293, 446, 323, 461]},1:{'key2':[411, 404, 443, 419]}}

(Rectangle format is [llX, llY, urX, urY]) Now we have rectangles to assign 2 rectangles to page 1, and 1 rectangle to page 2.

Add a URLs dictionary that uses those keys to look up the URLs to assign:

destinations_dictionary = {'key1':'url1','key2':'url2'}

We can then add the appropriate links to all those rectangle zones:

def make_pdf(rectangles_dictionary,destinations_dictionary):
    input = reader(file('pdfs/input_pdf.pdf','rb'))
    output = file('pdfs/output_pdf.pdf','wb')
    result = writer()

    for pagenum in range(0, input.getNumPages()):
        page = input.getPage(pagenum)
        result.addPage(page)

    for pagenum in rectangles_dictionary.keys():

        for name in rectangles_dictionary[pagenum].keys():
            for rectangle in rectangles_dictionary[pagenum][name]:

                    destination = destinations_dictionary[name]
                    result.addURI(pagenum, destination, rectangle)

    result.write(output)

Cleaner ways to do the first half there with JSON or something but for my implementation it was the most efficient way.

The key line of course is this one:

result.addURI(pagenum, destination, rectangle)

With pagenum as int(), destination as str(), and rectangle as list()

查看更多
我命由我不由天
3楼-- · 2020-06-06 08:03

1. 'regex' approach won't work!

What you 'want', ('use regex to find matches in PDF') is not possible! Plain and simple answer.

Reasons:

For the general case, you cannot use regexes in order to find 'matches' in a PDF text. And I will not even talk about Unicode characters here...

I'll only consider the simple string of text from the example in your question: match.

In PDF source code, this string could be present in different incarnations, depending on the PDF generating software as well as on the exact font with font encoding being used. The following listing is not complete!

(match) Tj                       # you are lucky
<6d61746365> Tj                  # hex representation of characters
<6d 61 74 63 65> Tj              # hex representation of characters, v2
<6d   61 7463   65> Tj           # hex representation of characters, v3
<6d>Tj <61>   Tj<746365>Tj       # hex representation of characters, v4
....                             # skipping version 5-500000000 of all... 
                                         # ...possible hex representations
(\155\141\164\143\150) Tj        # octal representation of characters
(m\141\164ch) Tj                 # octal/ascii mixed representation of chars
(\155a\164ch) Tj                 # octal/ascii mixed representation of chars, v3
<6d 61>Tj (\164c\150) Tj         # hex/octal/ascii mix
....                             # skipping many more possibilities

It gets more complicated even, if the font the string should be using does use a custom encoding (as is the case when the font is embedded in the PDF as a subset -- only containing these glyphs which are used in the respective text).

This could mean that what was <6d61746365> Tj above could become <2234567111> Tj with the custom encoded font, but it will still display match on the PDF page.


2. Workarounds to achieve similar results may work

  1. You can use pdftotext -layout some.pdf some.txt to create a file containing the text from your PDF. (This does not work reliably. Some PDFs, for example those which are missing a valid /ToUnicode table, will not lend themselves readily to text extraction.)

    This can lead you to the page number for a match.

    Using (with some trial'n'error) pdftotext -f 33 -l 33 -layout -x NN -y MM -W NN -H MM can narrow down the location of your match on page 33 more exactly.

    Using pdftotext -layout -bbox -f 33 -l 33 will return the coordinates of the bounding boxes for each word on page 33.

  2. You could use TET, the Text Extraction Toolkit to find the exact coordinates of matching words too. TET can give you the coordinates of individual glyphs even.

  3. Once you have identified the locations of your matches, you may be able to employ PDFlib to add your links.

查看更多
趁早两清
4楼-- · 2020-06-06 08:05

As PDF is a binary format, regexes are not the right approach to this problem. You need to use a python pdf library that can read and write pdf files. A starting point could be this SO question.

查看更多
登录 后发表回答