This question already has an answer here:
-
text-mine PDF files with Python?
2 answers
I have like 400 or more PDF files that together form a single text. Its like a book separated page by page. I need to programatically be able to search some keywords over the whole text.
So my first question is: is it better to search page by page or join all the PDFs in one big file first and then perform the search?
The second one is: what is the best way to make it? Is there already any good program or library out there?
By the way, I'm using PHP and Python, only.
Use PyPdf, as described here.
import pyPdf
def getPDFContent(path):
content = ""
# Load PDF into pyPDF
pdf = pyPdf.PdfFileReader(file(path, "rb"))
# Iterate pages
for i in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content += pdf.getPage(i).extractText() + "\n"
# Collapse whitespace
content = " ".join(content.replace("\xa0", " ").strip().split())
return content
for f in filelist:
print keyword in getPDFContent(f)
It is faster and much simpler to search them one by one, because you can then simply loop over all the files and use the code on every file.