Find string inside pdf with shell

2019-05-29 09:15发布

问题:

I'd like to know if there is any way to check if there is a string inside a pdf file using a shell script? I was looking for something like:

if [search(string,pdf_file)] > 0 then  
   echo "exist"
fi

回答1:

As nicely pointed by Simon, you can simply convert the pdf to plain text using pdftotext, and then, just search for what you're looking for.

After conversion, you may use grep, bash regex, or any variation you want:

while read line; do

    if [[ ${line} =~ [0-9]{4}(-[0-9]{2}){2} ]]; then
        echo ">>> Found date;";
    fi

done < <(pdftotext infile.pdf -)


回答2:

This approach converts the .pdf files page-wise, so the occurences of the search string $query can be located more specifically.

# search for query string in available pdf files pagewise
for i in *.pdf; do
    pagenr=$(pdfinfo "$i" | grep "Pages" | grep -o "[0-9][0-9]*")
    fileid="\n$i\n"
    for (( p=1; p<=pagenr; p++ )); do
        matches=$(pdftotext -q -f $p -l $p "$i" - | grep --color=always -in "$query")
        if [ -n "$matches" ]; then
            echo -e "${fileid}PAGE: $p"
            echo "$matches"
            fileid=""
        fi
    done
done

pdftotext -f $p -l $p limits the range to be converted to only one page identified by the number $p. grep --color=always allows for protecting match highlights in the subsequent echo. fileid="" just makes sure the file name of the .pdf document is only printed once for multiple matches.



回答3:

Each letter within a PDF doc is typically set individually. Therefore, you have to convert the .pdf to text, which will reduce the text to a simple stream.

I would try this:

grep -q 'a \+string' <(pdf2text some.pdf - | tr '\n' ' ') && echo exists

The tr joins line breaks. The \+ allows for 1 or more space chars between words. Finally, grep -q only returns exit status 0/1 based on a match. It does not print matching lines.