I'm trying to extract text from a large number of PDFs using PDFMiner python bindings. The module I wrote works for many PDFs, but I get this somewhat cryptic error for a subset of PDFs:
ipython stack trace:
/usr/lib/python2.7/dist-packages/pdfminer/pdfparser.pyc in set_parser(self, parser)
331 break
332 else:
--> 333 raise PDFSyntaxError('No /Root object! - Is this really a PDF?')
334 if self.catalog.get('Type') is not LITERAL_CATALOG:
335 if STRICT:
PDFSyntaxError: No /Root object! - Is this really a PDF?
Of course, I immediately checked to see whether or not these PDFs were corrupted, but they can be read just fine.
Is there any way to read these PDFs despite the absence of a root object? I'm not too sure where to go from here.
Many thanks!
Edit:
I tried using PyPDF in an attempt to get some differential diagnostics. The stack trace is below:
In [50]: pdf = pyPdf.PdfFileReader(file(fail, "rb"))
---------------------------------------------------------------------------
PdfReadError Traceback (most recent call last)
/home/louist/Desktop/pdfs/indir/<ipython-input-50-b7171105c81f> in <module>()
----> 1 pdf = pyPdf.PdfFileReader(file(fail, "rb"))
/usr/lib/pymodules/python2.7/pyPdf/pdf.pyc in __init__(self, stream)
372 self.flattenedPages = None
373 self.resolvedObjects = {}
--> 374 self.read(stream)
375 self.stream = stream
376 self._override_encryption = False
/usr/lib/pymodules/python2.7/pyPdf/pdf.pyc in read(self, stream)
708 line = self.readNextEndLine(stream)
709 if line[:5] != "%%EOF":
--> 710 raise utils.PdfReadError, "EOF marker not found"
711
712 # find startxref entry - the location of the xref table
PdfReadError: EOF marker not found
Quonux suggested that perhaps PDFMiner stopped parsing after reaching the first EOF character. This would seem to suggest otherwise, but I'm very much clueless. Any thoughts?
I have had this same problem in Ubuntu. I have a very simple solution. Just print the pdf-file as a pdf. If you are in Ubuntu:
Open a pdf file using the (ubuntu) document viewer.
Goto File
Goto print
Choose print as file and check the mark "pdf"
If you want to make the process automatic, follow for instance this, i.e., use this script to print automatically all your pdf files. A linux script like this also works:
Note I called the original (problematic) pdf files as pdfx.
I got this error as well and kept trying fp = open('example','rb')
However, I still got the error OP indicated. What I found is that I had bug in my code where the PDF was still open by another function.
So make sure you don't have the PDF open in memory elsewhere as well.
interesting problem. i had performed some kind of research:
function which parsed pdf (from miners source code):
if you will be have problem with EOF another exception will be raised: '''another function from source'''
from wiki(pdf specs): A PDF file consists primarily of objects, of which there are eight types:
Objects may be either direct (embedded in another object) or indirect. Indirect objects are numbered with an object number and a generation number. An index table called the xref table gives the byte offset of each indirect object from the start of the file. This design allows for efficient random access to the objects in the file, and also allows for small changes to be made without rewriting the entire file (incremental update). Beginning with PDF version 1.5, indirect objects may also be located in special streams known as object streams. This technique reduces the size of files that have large numbers of small indirect objects and is especially useful for Tagged PDF.
i thk the problem is your "damaged pdf" have a few 'root elements' on the page.
Possible solution:
you can download sources and write `print function' in each places where xref objects retrieved and where parser tried to parse this objects. it will be possible to determine full stack of error(before this error is appeared).
ps: i think it some kind of bug in product.
The solution in slate pdf is use 'rb' --> read binary mode.
Because slate pdf is depends on the PDFMiner and I have the same problem, this should solve your problem.
An answer above is right. This error appears only in windows, and workaround is to replace
with open(path, 'rb')
tofp = open(path,'rb')