I'm trying to read a pdf file where each page is divided into 3x3 blocks of information of the form
A | B | C
D | E | F
G | H | I
Each of the entries is broken into multiple lines. A simplified example of one entry is this card. But then there would be similar entries in the other 8 slots.
I've looked at pdfminer and pypdf2. I haven't found pdfminer overly useful, but pypdf2 has given me something close.
import PyPDF2
from StringIO import StringIO
def getPDFContent(path):
content = ""
p = file(path, "rb")
pdf = PyPDF2.PdfFileReader(p)
numPages = pdf.getNumPages()
for i in range(numPages):
content += pdf.getPage(i).extractText() + "\n"
content = " ".join(content.replace(u"\xa0", " ").strip().split())
return content
However, this only reads the file line by line. I'd like a solution where I can read only a portion of the page so that I could read A, then B, then C, and so on. Also, the answer here works fairly well, but the order of
columns routinely gets distorted and I've only gotten it to read line by line.
I assume the PDF files in question are generated PDFs rather than scanned (as in the example you gave), given that you're using
pdfminer
andpypdf2
. If you know the size of the columns and rows in inches you can useminecart
(full disclosure: I wroteminecart
). Example code: