Which are the best Python modules to convert PDF files into text?
问题:
回答1:
Try PDFMiner. It can extract text from PDF files as HTML, SGML or \"Tagged PDF\" format.
http://www.unixuser.org/~euske/python/pdfminer/index.html
The Tagged PDF format seems to be the cleanest, and stripping out the XML tags leaves just the bare text.
A Python 3 version is available under:
- https://github.com/pdfminer/pdfminer.six
回答2:
The PDFMiner package has changed since codeape posted.
EDIT (again):
PDFMiner has been updated again in version 20100213
You can check the version you have installed with the following:
>>> import pdfminer
>>> pdfminer.__version__
\'20100213\'
Here\'s the updated version (with comments on what I changed/added):
def pdf_to_csv(filename):
from cStringIO import StringIO #<-- added so you can copy/paste this to try it
from pdfminer.converter import LTTextItem, TextConverter
from pdfminer.pdfparser import PDFDocument, PDFParser
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
class CsvConverter(TextConverter):
def __init__(self, *args, **kwargs):
TextConverter.__init__(self, *args, **kwargs)
def end_page(self, i):
from collections import defaultdict
lines = defaultdict(lambda : {})
for child in self.cur_item.objs:
if isinstance(child, LTTextItem):
(_,_,x,y) = child.bbox #<-- changed
line = lines[int(-y)]
line[x] = child.text.encode(self.codec) #<-- changed
for y in sorted(lines.keys()):
line = lines[y]
self.outfp.write(\";\".join(line[x] for x in sorted(line.keys())))
self.outfp.write(\"\\n\")
# ... the following part of the code is a remix of the
# convert() function in the pdfminer/tools/pdf2text module
rsrc = PDFResourceManager()
outfp = StringIO()
device = CsvConverter(rsrc, outfp, codec=\"utf-8\") #<-- changed
# becuase my test documents are utf-8 (note: utf-8 is the default codec)
doc = PDFDocument()
fp = open(filename, \'rb\')
parser = PDFParser(fp) #<-- changed
parser.set_document(doc) #<-- added
doc.set_parser(parser) #<-- added
doc.initialize(\'\')
interpreter = PDFPageInterpreter(rsrc, device)
for i, page in enumerate(doc.get_pages()):
outfp.write(\"START PAGE %d\\n\" % i)
interpreter.process_page(page)
outfp.write(\"END PAGE %d\\n\" % i)
device.close()
fp.close()
return outfp.getvalue()
Edit (yet again):
Here is an update for the latest version in pypi, 20100619p1
. In short I replaced LTTextItem
with LTChar
and passed an instance of LAParams to the CsvConverter constructor.
def pdf_to_csv(filename):
from cStringIO import StringIO
from pdfminer.converter import LTChar, TextConverter #<-- changed
from pdfminer.layout import LAParams
from pdfminer.pdfparser import PDFDocument, PDFParser
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
class CsvConverter(TextConverter):
def __init__(self, *args, **kwargs):
TextConverter.__init__(self, *args, **kwargs)
def end_page(self, i):
from collections import defaultdict
lines = defaultdict(lambda : {})
for child in self.cur_item.objs:
if isinstance(child, LTChar): #<-- changed
(_,_,x,y) = child.bbox
line = lines[int(-y)]
line[x] = child.text.encode(self.codec)
for y in sorted(lines.keys()):
line = lines[y]
self.outfp.write(\";\".join(line[x] for x in sorted(line.keys())))
self.outfp.write(\"\\n\")
# ... the following part of the code is a remix of the
# convert() function in the pdfminer/tools/pdf2text module
rsrc = PDFResourceManager()
outfp = StringIO()
device = CsvConverter(rsrc, outfp, codec=\"utf-8\", laparams=LAParams()) #<-- changed
# becuase my test documents are utf-8 (note: utf-8 is the default codec)
doc = PDFDocument()
fp = open(filename, \'rb\')
parser = PDFParser(fp)
parser.set_document(doc)
doc.set_parser(parser)
doc.initialize(\'\')
interpreter = PDFPageInterpreter(rsrc, device)
for i, page in enumerate(doc.get_pages()):
outfp.write(\"START PAGE %d\\n\" % i)
if page is not None:
interpreter.process_page(page)
outfp.write(\"END PAGE %d\\n\" % i)
device.close()
fp.close()
return outfp.getvalue()
EDIT (one more time):
Updated for version 20110515
(thanks to Oeufcoque Penteano!):
def pdf_to_csv(filename):
from cStringIO import StringIO
from pdfminer.converter import LTChar, TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfparser import PDFDocument, PDFParser
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
class CsvConverter(TextConverter):
def __init__(self, *args, **kwargs):
TextConverter.__init__(self, *args, **kwargs)
def end_page(self, i):
from collections import defaultdict
lines = defaultdict(lambda : {})
for child in self.cur_item._objs: #<-- changed
if isinstance(child, LTChar):
(_,_,x,y) = child.bbox
line = lines[int(-y)]
line[x] = child._text.encode(self.codec) #<-- changed
for y in sorted(lines.keys()):
line = lines[y]
self.outfp.write(\";\".join(line[x] for x in sorted(line.keys())))
self.outfp.write(\"\\n\")
# ... the following part of the code is a remix of the
# convert() function in the pdfminer/tools/pdf2text module
rsrc = PDFResourceManager()
outfp = StringIO()
device = CsvConverter(rsrc, outfp, codec=\"utf-8\", laparams=LAParams())
# becuase my test documents are utf-8 (note: utf-8 is the default codec)
doc = PDFDocument()
fp = open(filename, \'rb\')
parser = PDFParser(fp)
parser.set_document(doc)
doc.set_parser(parser)
doc.initialize(\'\')
interpreter = PDFPageInterpreter(rsrc, device)
for i, page in enumerate(doc.get_pages()):
outfp.write(\"START PAGE %d\\n\" % i)
if page is not None:
interpreter.process_page(page)
outfp.write(\"END PAGE %d\\n\" % i)
device.close()
fp.close()
return outfp.getvalue()
回答3:
Since none for these solutions support the latest version of PDFMiner I wrote a simple solution that will return text of a pdf using PDFMiner. This will work for those who are getting import errors with process_pdf
import sys
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter
from pdfminer.layout import LAParams
from cStringIO import StringIO
def pdfparser(data):
fp = file(data, \'rb\')
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = \'utf-8\'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
# Create a PDF interpreter object.
interpreter = PDFPageInterpreter(rsrcmgr, device)
# Process each page contained in the document.
for page in PDFPage.get_pages(fp):
interpreter.process_page(page)
data = retstr.getvalue()
print data
if __name__ == \'__main__\':
pdfparser(sys.argv[1])
See below code that works for Python 3:
import sys
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter
from pdfminer.layout import LAParams
import io
def pdfparser(data):
fp = open(data, \'rb\')
rsrcmgr = PDFResourceManager()
retstr = io.StringIO()
codec = \'utf-8\'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
# Create a PDF interpreter object.
interpreter = PDFPageInterpreter(rsrcmgr, device)
# Process each page contained in the document.
for page in PDFPage.get_pages(fp):
interpreter.process_page(page)
data = retstr.getvalue()
print(data)
if __name__ == \'__main__\':
pdfparser(sys.argv[1])
回答4:
pyPDF works fine (assuming that you\'re working with well-formed PDFs). If all you want is the text (with spaces), you can just do:
import pyPdf
pdf = pyPdf.PdfFileReader(open(filename, \"rb\"))
for page in pdf.pages:
print page.extractText()
You can also easily get access to the metadata, image data, and so forth.
A comment in the extractText code notes:
Locate all text drawing commands, in the order they are provided in the content stream, and extract the text. This works well for some PDF files, but poorly for others, depending on the generator used. This will be refined in the future. Do not rely on the order of text coming out of this function, as it will change if this function is made more sophisticated.
Whether or not this is a problem depends on what you\'re doing with the text (e.g. if the order doesn\'t matter, it\'s fine, or if the generator adds text to the stream in the order it will be displayed, it\'s fine). I have pyPdf extraction code in daily use, without any problems.
回答5:
Pdftotext An open source program (part of Xpdf) which you could call from python (not what you asked for but might be useful). I\'ve used it with no problems. I think google use it in google desktop.
回答6:
You can also quite easily use pdfminer as a library. You have access to the pdf\'s content model, and can create your own text extraction. I did this to convert pdf contents to semi-colon separated text, using the code below.
The function simply sorts the TextItem content objects according to their y and x coordinates, and outputs items with the same y coordinate as one text line, separating the objects on the same line with \';\' characters.
Using this approach, I was able to extract text from a pdf that no other tool was able to extract content suitable for further parsing from. Other tools I tried include pdftotext, ps2ascii and the online tool pdftextonline.com.
pdfminer is an invaluable tool for pdf-scraping.
def pdf_to_csv(filename):
from pdflib.page import TextItem, TextConverter
from pdflib.pdfparser import PDFDocument, PDFParser
from pdflib.pdfinterp import PDFResourceManager, PDFPageInterpreter
class CsvConverter(TextConverter):
def __init__(self, *args, **kwargs):
TextConverter.__init__(self, *args, **kwargs)
def end_page(self, i):
from collections import defaultdict
lines = defaultdict(lambda : {})
for child in self.cur_item.objs:
if isinstance(child, TextItem):
(_,_,x,y) = child.bbox
line = lines[int(-y)]
line[x] = child.text
for y in sorted(lines.keys()):
line = lines[y]
self.outfp.write(\";\".join(line[x] for x in sorted(line.keys())))
self.outfp.write(\"\\n\")
# ... the following part of the code is a remix of the
# convert() function in the pdfminer/tools/pdf2text module
rsrc = PDFResourceManager()
outfp = StringIO()
device = CsvConverter(rsrc, outfp, \"ascii\")
doc = PDFDocument()
fp = open(filename, \'rb\')
parser = PDFParser(doc, fp)
doc.initialize(\'\')
interpreter = PDFPageInterpreter(rsrc, device)
for i, page in enumerate(doc.get_pages()):
outfp.write(\"START PAGE %d\\n\" % i)
interpreter.process_page(page)
outfp.write(\"END PAGE %d\\n\" % i)
device.close()
fp.close()
return outfp.getvalue()
UPDATE:
The code above is written against an old version of the API, see my comment below.
回答7:
slate
is a project that makes it very simple to use PDFMiner from a library:
>>> with open(\'example.pdf\') as f:
... doc = slate.PDF(f)
...
>>> doc
[..., ..., ...]
>>> doc[1]
\'Text from page 2...\'
回答8:
I needed to convert a specific PDF to plain text within a python module. I used PDFMiner 20110515, after reading through their pdf2txt.py tool I wrote this simple snippet:
from cStringIO import StringIO
from pdfminer.pdfinterp import PDFResourceManager, process_pdf
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
def to_txt(pdf_path):
input_ = file(pdf_path, \'rb\')
output = StringIO()
manager = PDFResourceManager()
converter = TextConverter(manager, output, laparams=LAParams())
process_pdf(manager, converter, input_)
return output.getvalue()
回答9:
Repurposing the pdf2txt.py code that comes with pdfminer; you can make a function that will take a path to the pdf; optionally, an outtype (txt|html|xml|tag) and opts like the commandline pdf2txt {\'-o\': \'/path/to/outfile.txt\' ...}. By default, you can call:
convert_pdf(path)
A text file will be created, a sibling on the filesystem to the original pdf.
def convert_pdf(path, outtype=\'txt\', opts={}):
import sys
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter, process_pdf
from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter, TagExtractor
from pdfminer.layout import LAParams
from pdfminer.pdfparser import PDFDocument, PDFParser
from pdfminer.pdfdevice import PDFDevice
from pdfminer.cmapdb import CMapDB
outfile = path[:-3] + outtype
outdir = \'/\'.join(path.split(\'/\')[:-1])
debug = 0
# input option
password = \'\'
pagenos = set()
maxpages = 0
# output option
codec = \'utf-8\'
pageno = 1
scale = 1
showpageno = True
laparams = LAParams()
for (k, v) in opts:
if k == \'-d\': debug += 1
elif k == \'-p\': pagenos.update( int(x)-1 for x in v.split(\',\') )
elif k == \'-m\': maxpages = int(v)
elif k == \'-P\': password = v
elif k == \'-o\': outfile = v
elif k == \'-n\': laparams = None
elif k == \'-A\': laparams.all_texts = True
elif k == \'-D\': laparams.writing_mode = v
elif k == \'-M\': laparams.char_margin = float(v)
elif k == \'-L\': laparams.line_margin = float(v)
elif k == \'-W\': laparams.word_margin = float(v)
elif k == \'-O\': outdir = v
elif k == \'-t\': outtype = v
elif k == \'-c\': codec = v
elif k == \'-s\': scale = float(v)
#
CMapDB.debug = debug
PDFResourceManager.debug = debug
PDFDocument.debug = debug
PDFParser.debug = debug
PDFPageInterpreter.debug = debug
PDFDevice.debug = debug
#
rsrcmgr = PDFResourceManager()
if not outtype:
outtype = \'txt\'
if outfile:
if outfile.endswith(\'.htm\') or outfile.endswith(\'.html\'):
outtype = \'html\'
elif outfile.endswith(\'.xml\'):
outtype = \'xml\'
elif outfile.endswith(\'.tag\'):
outtype = \'tag\'
if outfile:
outfp = file(outfile, \'w\')
else:
outfp = sys.stdout
if outtype == \'txt\':
device = TextConverter(rsrcmgr, outfp, codec=codec, laparams=laparams)
elif outtype == \'xml\':
device = XMLConverter(rsrcmgr, outfp, codec=codec, laparams=laparams, outdir=outdir)
elif outtype == \'html\':
device = HTMLConverter(rsrcmgr, outfp, codec=codec, scale=scale, laparams=laparams, outdir=outdir)
elif outtype == \'tag\':
device = TagExtractor(rsrcmgr, outfp, codec=codec)
else:
return usage()
fp = file(path, \'rb\')
process_pdf(rsrcmgr, device, fp, pagenos, maxpages=maxpages, password=password)
fp.close()
device.close()
outfp.close()
return
回答10:
Additionally there is PDFTextStream which is a commercial Java library that can also be used from Python.
回答11:
I have used pdftohtml with the \'-xml\' argument, read the result with subprocess.Popen(), that will give you x coord, y coord, width, height, and font, of every \'snippet\' of text in the pdf. I think this is what \'evince\' probably uses too because the same error messages spew out.
If you need to process columnar data, it gets slightly more complicated as you have to invent an algorithm that suits your pdf file. The problem is that the programs that make PDF files don\'t really necessarily lay out the text in any logical format. You can try simple sorting algorithms and it works sometimes, but there can be little \'stragglers\' and \'strays\', pieces of text that don\'t get put in the order you thought they would... so you have to get creative.
It took me about 5 hours to figure out one for the pdf\'s i was working on. But it works pretty good now. Good luck.
回答12:
PDFminer gave me perhaps one line [page 1 of 7...] on every page of a pdf file I tried with it.
The best answer I have so far is pdftoipe, or the c++ code it\'s based on Xpdf.
see my question for what the output of pdftoipe looks like.
回答13:
Found that solution today. Works great for me. Even rendering PDF pages to PNG images. http://www.swftools.org/gfx_tutorial.html