How to search keywords in 400+ PDF files? [duplica

2019-09-11 04:01发布

This question already has an answer here:

I have like 400 or more PDF files that together form a single text. Its like a book separated page by page. I need to programatically be able to search some keywords over the whole text.

So my first question is: is it better to search page by page or join all the PDFs in one big file first and then perform the search?

The second one is: what is the best way to make it? Is there already any good program or library out there?

By the way, I'm using PHP and Python, only.

1条回答
Fickle 薄情
2楼-- · 2019-09-11 04:15

Use PyPdf, as described here.

import pyPdf

def getPDFContent(path):
    content = ""
    # Load PDF into pyPDF
    pdf = pyPdf.PdfFileReader(file(path, "rb"))
    # Iterate pages
    for i in range(0, pdf.getNumPages()):
        # Extract text from page and add to content
        content += pdf.getPage(i).extractText() + "\n"
    # Collapse whitespace
    content = " ".join(content.replace("\xa0", " ").strip().split())
    return content

for f in filelist:
    print keyword in getPDFContent(f)

It is faster and much simpler to search them one by one, because you can then simply loop over all the files and use the code on every file.

查看更多
登录 后发表回答