Find string inside pdf with shell

2019-05-29 09:39发布

I'd like to know if there is any way to check if there is a string inside a pdf file using a shell script? I was looking for something like:

if [search(string,pdf_file)] > 0 then  
   echo "exist"
fi

3条回答
小情绪 Triste *
2楼-- · 2019-05-29 09:51

Each letter within a PDF doc is typically set individually. Therefore, you have to convert the .pdf to text, which will reduce the text to a simple stream.

I would try this:

grep -q 'a \+string' <(pdf2text some.pdf - | tr '\n' ' ') && echo exists

The tr joins line breaks. The \+ allows for 1 or more space chars between words. Finally, grep -q only returns exit status 0/1 based on a match. It does not print matching lines.

查看更多
孤傲高冷的网名
3楼-- · 2019-05-29 09:57

As nicely pointed by Simon, you can simply convert the pdf to plain text using pdftotext, and then, just search for what you're looking for.

After conversion, you may use grep, bash regex, or any variation you want:

while read line; do

    if [[ ${line} =~ [0-9]{4}(-[0-9]{2}){2} ]]; then
        echo ">>> Found date;";
    fi

done < <(pdftotext infile.pdf -)
查看更多
叛逆
4楼-- · 2019-05-29 10:07

This approach converts the .pdf files page-wise, so the occurences of the search string $query can be located more specifically.

# search for query string in available pdf files pagewise
for i in *.pdf; do
    pagenr=$(pdfinfo "$i" | grep "Pages" | grep -o "[0-9][0-9]*")
    fileid="\n$i\n"
    for (( p=1; p<=pagenr; p++ )); do
        matches=$(pdftotext -q -f $p -l $p "$i" - | grep --color=always -in "$query")
        if [ -n "$matches" ]; then
            echo -e "${fileid}PAGE: $p"
            echo "$matches"
            fileid=""
        fi
    done
done

pdftotext -f $p -l $p limits the range to be converted to only one page identified by the number $p. grep --color=always allows for protecting match highlights in the subsequent echo. fileid="" just makes sure the file name of the .pdf document is only printed once for multiple matches.

查看更多
登录 后发表回答