I have several PDFs that were generated with Microsoft Word. I want to:
- Use a regex to find matches in the PDF text.
- Convert the matching text to a link that points to an external URL.
- Save the new version of the PDF.
If I were doing this in HTML, it would look like this:
<!-- before: -->
This is the text to match.
<!-- after: -->
This is the text to <a href="http://www.match.com/" target="_blank">match</a>.
How can I do this to a PDF?
I'd prefer Python, but I'm open to alternatives.
Edit: I don't have access to the original Word documents. I need to manipulate the PDFs themselves. I'm looking for a technique using a Python PDF library (or something similar in another language).
Edit 2: I understand that the source code of a PDF doesn't contain literal strings. I'm wondering if there's an approach that could do something like: (1) extract the text, (2) find matches, and (3) for each match, draw a clickable box around the position of the text in the original PDF. The closest I've come is PyPDF2's addLink()
, but that adds internal links in the PDF, not links to external URLs.
I have solved this. Appreciate anyone cleaning up any errors. https://github.com/JohnMulligan/PyPDF2/tree/URI-linking
Because Kurt answered most of parts 1 and 2, I'm going to restrict my answer to part 3 of the original question: how to add external links to a PDF. (I have a fully working answer to 1 & 2, but it's inelegant. If people want it, I'll post that, too.)
My branch of PyPDF2 has
addURI
functionality, that works in the same way as the package's originaladdLink()
.Specifically: With a rectangles dictionary that has has pagenumber keys:
(Rectangle format is
[llX, llY, urX, urY]
) Now we have rectangles to assign 2 rectangles to page 1, and 1 rectangle to page 2.Add a URLs dictionary that uses those keys to look up the URLs to assign:
We can then add the appropriate links to all those rectangle zones:
Cleaner ways to do the first half there with JSON or something but for my implementation it was the most efficient way.
The key line of course is this one:
With
pagenum
asint()
, destination asstr()
, and rectangle aslist()
1. 'regex' approach won't work!
What you 'want', ('use regex to find matches in PDF') is not possible! Plain and simple answer.
Reasons:
For the general case, you cannot use regexes in order to find 'matches' in a PDF text. And I will not even talk about Unicode characters here...
I'll only consider the simple string of text from the example in your question:
match
.In PDF source code, this string could be present in different incarnations, depending on the PDF generating software as well as on the exact font with font encoding being used. The following listing is not complete!
It gets more complicated even, if the font the string should be using does use a custom encoding (as is the case when the font is embedded in the PDF as a subset -- only containing these glyphs which are used in the respective text).
This could mean that what was
<6d61746365> Tj
above could become<2234567111> Tj
with the custom encoded font, but it will still displaymatch
on the PDF page.2. Workarounds to achieve similar results may work
You can use
pdftotext -layout some.pdf some.txt
to create a file containing the text from your PDF. (This does not work reliably. Some PDFs, for example those which are missing a valid/ToUnicode
table, will not lend themselves readily to text extraction.)This can lead you to the page number for a match.
Using (with some trial'n'error)
pdftotext -f 33 -l 33 -layout -x NN -y MM -W NN -H MM
can narrow down the location of your match on page 33 more exactly.Using
pdftotext -layout -bbox -f 33 -l 33
will return the coordinates of the bounding boxes for each word on page 33.You could use TET, the Text Extraction Toolkit to find the exact coordinates of matching words too. TET can give you the coordinates of individual glyphs even.
Once you have identified the locations of your matches, you may be able to employ PDFlib to add your links.
As PDF is a binary format, regexes are not the right approach to this problem. You need to use a python pdf library that can read and write pdf files. A starting point could be this SO question.