pyPDF2 TypeError when trying to extract text

2019-07-21 21:33发布

问题:

I have successfully installed pyPDF, but the extractText method does not work well, so i decided to try pyPDF2, the problem is, when extracting text there is an exception:

Traceback (most recent call last):
  File "C:\Users\Asus\Desktop\pfdtest.py", line 44, in <module>
    test2()
  File "C:\Users\Asus\Desktop\pfdtest.py", line 41, in test2
    print(mypdf.getPage(0).extractText())
  File "C:\Python32\lib\site-packages\PyPDF2\pdf.py", line 1701, in extractText
    content = ContentStream(content, self.pdf)
  File "C:\Python32\lib\site-packages\PyPDF2\pdf.py", line 1783, in __init__
    stream = StringIO(stream.getData())
TypeError: initial_value must be str or None, not bytes

and this is my sample code:

filename = "myfile.pdf"
f = open(filename,'rb')
mypdf = PdfFileReader(f)
print(f,mypdf,mypdf.getNumPages())
print(mypdf.getPage(0).extractText())

It correctly determines the amount of pages in the pdf, but it has a problem with reading the stream.

回答1:

It was a problem related to the compatibility within PyPDF2 and Python 3.

In my case, I have solved it by replacing pdf.py and utils.py with the ones you will find here, where they basically control if you are running Python 3 and, in case you are, receive data as bytes instead of strings.