I am trying to parse some object elements from a PDF file using re module of Python. My goal is to parse each PDF object using a regular expression.
A PDF object example is the following:
1 0 obj
<<
/Type /Catalog
/Pages 2 0 R
>>
endobj
2 0 obj
<<
/Type /Pages
/Kids [ 3 0 R ]
/Count 1
>>
endobj
...
When I use "\d+\s\d+\sobj[\s,\S]*endobj"
it doesn't work (it keeps parsing util last endobj is found). How can I modify regular expression in order to parse each object seperately (in other words the part from 1 0 obj until endobj)?
A question mark after the repeated part should take the minimal amount of characters. Also comma is not necessary because
\S
already takes it into account.If you are using only regex, it is easy to construct a PDF file that your program will not be able to handle. PDF dictionaries and lists can contain other objects. Regex can't handle recursive structures, at least not Python re module.
A pdf file is a tree of objects and streams:
<<
(name value)*>>
[
(value)*]
/
(regular char)*(
(char)*)
<
(hexchar)*>
-
)? ((digit)+ | (digit)+.
(digit)* |.
(digit)+)true
|false
R
Whitespace and comments are ignored in most places. Comments start with
%
and run until the end of the line.Indirect objects are specified as:
This object can then be referenced as
1 0 R
. Indirect dictionaries can also have a stream attached:A PDF file looks something like this:
The root of the object tree is the
trailer
object. Every objects is referenced directly or indirectly from this dictionary.There are a lot more complexity hidden inside the streams, but that does not affect the file structure.
The full specification can be found at Adobe's website.
You need to use
*?
as the non-greedy version - see documentation here.Also, note that PDF format is very complex - especially when it starts having binary streams within it - but if you know the PDFs you are looking at are simple then this should work.
Not exactly an answer to your exact question, but you might want to have look at existing pdf parsing libraries in python, for example: pdfminer or pyPdf. (even if you do not end up using them, you might as well have a look and see how they do it)