I am writing a script to extract some data from a PDF. The PDF itself is pretty complicated, since it has multiple columns. So I figured out that I should crop each column and concatenate the columns to make a new PDF that is better for parsing using pyPdf. This is my code:
for i in range(numPages):
page1 = input1.getPage(i)
page1.trimBox.lowerLeft=(0,550)
page1.trimBox.upperRight = (480, 842)
page1.cropBox.lowerLeft = (0, 550)
page1.cropBox.upperRight = (480, 842)
output.addPage(page1)
page2= input2.getPage(i)
print page1.mediaBox.getUpperRight_x(), page1.mediaBox.getUpperRight_y()
page2.trimBox.lowerLeft=(0,280)
page2.trimBox.upperRight = (480, 550)
page2.cropBox.lowerLeft = (0, 280)
page2.cropBox.upperRight = (480, 550)
output.addPage(page2)
page3 = input3.getPage(i)
page3.trimBox.lowerLeft=(0,0)
page3.trimBox.upperRight = (480, 280)
page3.cropBox.lowerLeft = (0, 0)
page3.cropBox.upperRight = (480, 280)
output.addPage(page3)
outputStream = file("out.pdf", "wb")
output.write(outputStream)
outputStream.close()
Then, I send this PDF to a PHP server to parse it and obtain the text. Unexpectedly, that did not help. cropBox turned out to be changing the viewable part of the PDF. The other parts are there, but they just cannot be viewed. When I processed the new PDF using PHP, I got the same results. My question is: is there a way to make cropBox really crop the box and ignore the remaining part of the PDF page?