Python text extraction does not work on some pdfs

2019-04-02 13:39发布

问题:

I am trying to read a pdf through url. I followed many stackoverflow suggestions and used PyPdf2 FileReader to extract text from the pdf. My code looks like this :

url = "http://kat.kar.nic.in:8080/uploadedFiles/C_13052015_ch1_l1.pdf"
#url = "http://kat.kar.nic.in:8080/uploadedFiles/C_06052015_ch1_l1.pdf"
f = urlopen(Request(url)).read()
fileInput = StringIO(f)
pdf = PyPDF2.PdfFileReader(fileInput)

print pdf.getNumPages()
print pdf.getDocumentInfo()
print pdf.getPage(1).extractText()

I am able to successfully extract text for first link. But if I use the same program for the second pdf. I do not get any text. The page numbers and document info seem to show up.

I tried extracting text from Pdfminer through terminal and was able to extract text from the second pdf.

Any idea what is wrong with the pdf or is there a drawback with the libraries I am using ?

回答1:

If you read the comments in the pyPDF documentation you'll see that it's written right there that this functionality will not work well for some PDF files; in other words, you're looking at a restriction of the library.

Looking at the two PDF files, I can't see anything wrong with the files themselves. But...

The first file contains fully embedded fonts The second file contains subsetted fonts

This means that the second file is more difficult to extract text from and the library probably doesn't support that properly. Just for reference I did a text extraction with callas pdfToolbox (caution, I'm affiliated with this tool) which uses the Acrobat text extraction and the text is properly extracted for both files (confirming that it's not the PDF files that are the problem).