How to extract text from pdf in python 3.7.3

2020-06-22 21:50发布

I am trying to extract text from a PDF file using Python. My main goal is I am trying to create a program that reads a bank statement and extracts its text to update an excel file to easily record monthly spendings. Right now I am focusing just extracting the text from the pdf file but I don't know how to do so.

What is currently the best and easiest way to extract text from a PDF file into a string? What library is best to use today and how can I do it?

I have tried using PyPDF2 but everytime I try to extract text from any page using extractText(), it returns empty strings. I have tried installing textract but I get errors because I need more libraries I think.

import PyPDF2

pdfFileObj = open("January2019.pdf", 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

pageObj = pdfReader.getPage(0)
print(pageObj.extractText())

This prints empty strings when it should be printing the contents of the page

7条回答
太酷不给撩
2楼-- · 2020-06-22 22:26

I have tried many methods but failed, include PyPDF2 and Tika. I finally found the module pdfplumber that is work for me, you also can try it.

Hope this will be helpful to you.

import pdfplumber
pdf = pdfplumber.open('pdffile.pdf')
page = pdf.pages[0]
text = page.extract_text()
print(text)
pdf.close()
查看更多
手持菜刀,她持情操
3楼-- · 2020-06-22 22:35

PyPDF2 is highly unreliable for extracting text from pdf . as pointed out here too. it says :

While PyPDF2 has .extractText(), which can be used on its page objects (not shown in this example), it does not work very well. Some PDFs will return text and some will return an empty string. When you want to extract text from a PDF, you should check out the PDFMiner project instead. PDFMiner is much more robust and was specifically designed for extracting text from PDFs.

  1. You could instead install and use pdfminer using

    pip install pdfminer

  2. or you can use another open source utility named pdftotext by xpdfreader. instructions to use the utility is given on the page.

you can download the command line tools from here and could use the pdftotext.exe utility using subprocess .detailed explanation for using subprocess is given here

查看更多
\"骚年 ilove
4楼-- · 2020-06-22 22:42

Try pdfreader. You can extract either plain text or decoded text containing "pdf markdown":

from pdfreader import SimplePDFViewer, PageDoesNotExist

fd = open(you_pdf_file_name, "rb")
viewer = SimplePDFViewer(fd)

plain_text = ""
pdf_markdown = ""

try:
    while True:
        viewer.render()
        pdf_markdown += viewer.canvas.text_content
        plain_text += "".join(viewer.canvas.strings)
        viewer.next()
except PageDoesNotExist:
    pass

查看更多
狗以群分
5楼-- · 2020-06-22 22:43

PyPDF2 does not read whole pdf correctly. You must use this code.

    import pdftotext

    pdfFileObj = open("January2019.pdf", 'rb')


    pdf = pdftotext.PDF(pdfFileObj)

    # Iterate over all the pages
    for page in pdf:
        print(page)
查看更多
倾城 Initia
6楼-- · 2020-06-22 22:44
import PyPDF2
pdf-file = open('January2019.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdf-file)
count = pdfReader.numPages
for i in range(count):
    page = pdfReader.getPage(i)
    print(page.extractText())
查看更多
▲ chillily
7楼-- · 2020-06-22 22:47

Using tika worked for me!

from tika import parser

rawText = parser.from_file('January2019.pdf')

rawList = rawText['content'].splitlines()

This made it really easy to extract separate each line in the bank statement into a list.

查看更多
登录 后发表回答