Parsing PDF file using Regular expressions in Pyth

I am trying to parse some object elements from a PDF file using re module of Python. My goal is to parse each PDF object using a regular expression. A PDF object example is the following:

1 0 obj
<<
    /Type /Catalog
    /Pages 2 0 R
>>
endobj
2 0 obj
<<
    /Type /Pages
    /Kids [ 3 0 R ]
    /Count 1
>>
endobj
...

When I use "\d+\s\d+\sobj[\s,\S]*endobj" it doesn't work (it keeps parsing util last endobj is found). How can I modify regular expression in order to parse each object seperately (in other words the part from 1 0 obj until endobj)?

标签： python regex parsing pdf

4条回答

▲ chillily

2楼-- · 2019-06-27 16:33

A question mark after the repeated part should take the minimal amount of characters. Also comma is not necessary because \S already takes it into account.

\d+\s\d+\sobj[\s\S]*?endobj

0人赞添加讨论(0) 举报

Melony?

3楼-- · 2019-06-27 16:40

If you are using only regex, it is easy to construct a PDF file that your program will not be able to handle. PDF dictionaries and lists can contain other objects. Regex can't handle recursive structures, at least not Python re module.

A pdf file is a tree of objects and streams:

Dictionaries: << (name value)* >>
Lists: [ (value)* ]
Names: / (regular char)*
Strings: ( (char)* )
Hex strings: < (hexchar)* >
Numbers: (-)? ((digit)+ | (digit)+ . (digit)* | . (digit)+)
Booleans: true | false
References: (digit)+ (whitespace)+ (digit)+ (whitespace)+ R

Whitespace and comments are ignored in most places. Comments start with % and run until the end of the line.

Indirect objects are specified as:

1 0 obj
(any object)
endobj

This object can then be referenced as 1 0 R. Indirect dictionaries can also have a stream attached:

1 0 obj
<<
/Length 22
>>
stream
(22 bytes of raw data)
endstream
endobj

A PDF file looks something like this:

%PDF-1.4
%ÿÿÿÿ
1 0 obj
<< /Author (MizardX) >>
endobj
2 0 obj
<<
/Type /Catalog
% more required keys
>>
endobj
%lots of more indirect objects, one after another
trailer
<<
/Info 1 0 R
/Root 2 0 R
% ... more required keys
>>
xref
0 3
0000000000 65535 f
0000000015 00000 n
0000000054 00000 n
startxref
225
%%EOF

The root of the object tree is the trailer object. Every objects is referenced directly or indirectly from this dictionary.

There are a lot more complexity hidden inside the streams, but that does not affect the file structure.

The full specification can be found at Adobe's website.

0人赞添加讨论(0) 举报

虎瘦雄心在

4楼-- · 2019-06-27 16:46

You need to use *?as the non-greedy version - see documentation here.

Also, note that PDF format is very complex - especially when it starts having binary streams within it - but if you know the PDFs you are looking at are simple then this should work.

0人赞添加讨论(0) 举报

我欲成王，谁敢阻挡

5楼-- · 2019-06-27 16:47

Not exactly an answer to your exact question, but you might want to have look at existing pdf parsing libraries in python, for example: pdfminer or pyPdf. (even if you do not end up using them, you might as well have a look and see how they do it)

0人赞添加讨论(0) 举报

Parsing PDF file using Regular expressions in Pyth

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间