I want to be able to convert PDFs to CSV files and have found several useful scripts but, being new to Python, I have a question:
Where do you specify the filepath of the PDF and the CSV you want to print to?
I'm using Python 2.7.11 and PDFMiner 20140328.
import sys
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter
from pdfminer.layout import LAParams
from cStringIO import StringIO
def pdfparser(data):
fp = file(data, 'rb')
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.get_pages(fp):
interpreter.process_page(page)
data = retstr.getvalue()
print data
if __name__ == '__main__':
pdfparser(sys.argv[1])
Here is some modified code from this SO answer written by tgray:
The main difference between the answer in the link and this one is the line_creator method, which tries to extract some structure out of the PDF.
Should work with PDFminer 20140328.