How to search keywords in 400+ PDF files? [duplica

2019-09-11 03:41发布

问题:

This question already has an answer here:

text-mine PDF files with Python? 2 answers

I have like 400 or more PDF files that together form a single text. Its like a book separated page by page. I need to programatically be able to search some keywords over the whole text.

So my first question is: is it better to search page by page or join all the PDFs in one big file first and then perform the search?

The second one is: what is the best way to make it? Is there already any good program or library out there?

By the way, I'm using PHP and Python, only.

回答1:

Use PyPdf, as described here.

import pyPdf

def getPDFContent(path):
    content = ""
    # Load PDF into pyPDF
    pdf = pyPdf.PdfFileReader(file(path, "rb"))
    # Iterate pages
    for i in range(0, pdf.getNumPages()):
        # Extract text from page and add to content
        content += pdf.getPage(i).extractText() + "\n"
    # Collapse whitespace
    content = " ".join(content.replace("\xa0", " ").strip().split())
    return content

for f in filelist:
    print keyword in getPDFContent(f)

It is faster and much simpler to search them one by one, because you can then simply loop over all the files and use the code on every file.